2025-05-07T20:22:35.2633365Z Current runner version: '2.323.0'
2025-05-07T20:22:35.2643219Z Runner name: 'i-06f3d8044a6f79407'
2025-05-07T20:22:35.2644634Z Machine name: 'ip-10-0-69-200'
2025-05-07T20:22:35.2648830Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:22:35.2651872Z Contents: read
2025-05-07T20:22:35.2652601Z Metadata: read
2025-05-07T20:22:35.2653339Z Packages: read
2025-05-07T20:22:35.2654066Z ##[endgroup]
2025-05-07T20:22:35.2657355Z Secret source: None
2025-05-07T20:22:35.2658402Z Prepare workflow directory
2025-05-07T20:22:35.3205801Z Prepare all required actions
2025-05-07T20:22:35.3241711Z Getting action download info
2025-05-07T20:22:35.5052362Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:22:35.7991376Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:22:36.1737532Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:22:37.7857766Z Getting action download info
2025-05-07T20:22:37.9220447Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:22:38.1484126Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.10, 12.6.3, 12.6.3, clang)
2025-05-07T20:22:38.2033583Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:22:38.2150737Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:22:38.2163085Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:22:38.2164309Z ##[endgroup]
2025-05-07T20:22:39.3996613Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:22:39.3997102Z Instance Type: g5.4xlarge
2025-05-07T20:22:39.3997443Z AMI Name: unknown
2025-05-07T20:22:39.4037450Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:22:44.7435026Z ##[group]Run actions/checkout@v4
2025-05-07T20:22:44.7435341Z with:
2025-05-07T20:22:44.7435567Z   submodules: true
2025-05-07T20:22:44.7435806Z   repository: pytorch/FBGEMM
2025-05-07T20:22:44.7436207Z   token: ***
2025-05-07T20:22:44.7436412Z   ssh-strict: true
2025-05-07T20:22:44.7436625Z   ssh-user: git
2025-05-07T20:22:44.7436857Z   persist-credentials: true
2025-05-07T20:22:44.7437108Z   clean: true
2025-05-07T20:22:44.7437343Z   sparse-checkout-cone-mode: true
2025-05-07T20:22:44.7437623Z   fetch-depth: 1
2025-05-07T20:22:44.7437882Z   fetch-tags: false
2025-05-07T20:22:44.7438103Z   show-progress: true
2025-05-07T20:22:44.7438332Z   lfs: false
2025-05-07T20:22:44.7438543Z   set-safe-directory: true
2025-05-07T20:22:44.7438794Z env:
2025-05-07T20:22:44.7439009Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:44.7439326Z   BUILD_ENV: build_binary
2025-05-07T20:22:44.7439591Z   BUILD_TARGET: genai
2025-05-07T20:22:44.7439814Z   BUILD_VARIANT: cuda
2025-05-07T20:22:44.7440078Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:44.7440337Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:44.7440580Z ##[endgroup]
2025-05-07T20:22:44.8588730Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:22:44.8589896Z ##[group]Getting Git version info
2025-05-07T20:22:44.8590338Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.8590957Z [command]/usr/bin/git version
2025-05-07T20:22:44.8591229Z git version 2.47.1
2025-05-07T20:22:44.8612564Z ##[endgroup]
2025-05-07T20:22:44.8625319Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/9f623c0f-50d9-4606-af33-1a85c87373d0' before making global git config changes
2025-05-07T20:22:44.8626232Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:22:44.8639419Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.8677729Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.8680809Z ##[group]Initializing the repository
2025-05-07T20:22:44.8685416Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.8726733Z hint: Using 'master' as the name for the initial branch. This default branch name
2025-05-07T20:22:44.8727365Z hint: is subject to change. To configure the initial branch name to use in all
2025-05-07T20:22:44.8727923Z hint: of your new repositories, which will suppress this warning, call:
2025-05-07T20:22:44.8728299Z hint:
2025-05-07T20:22:44.8728598Z hint: 	git config --global init.defaultBranch <name>
2025-05-07T20:22:44.8728928Z hint:
2025-05-07T20:22:44.8729258Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
2025-05-07T20:22:44.8729799Z hint: 'development'. The just-created branch can be renamed via this command:
2025-05-07T20:22:44.8730205Z hint:
2025-05-07T20:22:44.8730434Z hint: 	git branch -m <name>
2025-05-07T20:22:44.8730934Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/
2025-05-07T20:22:44.8740162Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM
2025-05-07T20:22:44.8775469Z ##[endgroup]
2025-05-07T20:22:44.8775967Z ##[group]Disabling automatic garbage collection
2025-05-07T20:22:44.8779925Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:22:44.8813573Z ##[endgroup]
2025-05-07T20:22:44.8814375Z ##[group]Setting up auth
2025-05-07T20:22:44.8819340Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:22:44.8850907Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:22:44.9225036Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:22:44.9257624Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:22:44.9600223Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:44.9648603Z ##[endgroup]
2025-05-07T20:22:44.9649010Z ##[group]Fetching the repository
2025-05-07T20:22:44.9657292Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:22:45.3966749Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:22:45.3967558Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:22:45.3990950Z ##[endgroup]
2025-05-07T20:22:45.3994257Z ##[group]Determining the checkout info
2025-05-07T20:22:45.3994698Z ##[endgroup]
2025-05-07T20:22:45.3998974Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:22:45.4047043Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:22:45.4090367Z ##[group]Checking out the ref
2025-05-07T20:22:45.4094584Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:22:45.5187488Z Note: switching to 'refs/remotes/pull/4066/merge'.
2025-05-07T20:22:45.5187915Z 
2025-05-07T20:22:45.5188239Z You are in 'detached HEAD' state. You can look around, make experimental
2025-05-07T20:22:45.5188991Z changes and commit them, and you can discard any commits you make in this
2025-05-07T20:22:45.5189718Z state without impacting any branches by switching back to a branch.
2025-05-07T20:22:45.5190160Z 
2025-05-07T20:22:45.5190487Z If you want to create a new branch to retain commits you create, you may
2025-05-07T20:22:45.5191149Z do so (now or later) by using -c with the switch command. Example:
2025-05-07T20:22:45.5191523Z 
2025-05-07T20:22:45.5191688Z   git switch -c <new-branch-name>
2025-05-07T20:22:45.5191957Z 
2025-05-07T20:22:45.5192137Z Or undo this operation with:
2025-05-07T20:22:45.5192395Z 
2025-05-07T20:22:45.5192521Z   git switch -
2025-05-07T20:22:45.5193054Z 
2025-05-07T20:22:45.5193380Z Turn off this advice by setting config variable advice.detachedHead to false
2025-05-07T20:22:45.5193846Z 
2025-05-07T20:22:45.5194372Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:22:45.5202236Z ##[endgroup]
2025-05-07T20:22:45.5202648Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:22:45.5208383Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:45.5259887Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:22:45.5292611Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:22:45.5324669Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:22:45.5352885Z ##[endgroup]
2025-05-07T20:22:45.5353279Z ##[group]Fetching submodules
2025-05-07T20:22:45.5356916Z [command]/usr/bin/git submodule sync
2025-05-07T20:22:45.5702380Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:22:45.6034899Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit'
2025-05-07T20:22:45.6037678Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel'
2025-05-07T20:22:45.6041077Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo'
2025-05-07T20:22:45.6044678Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass'
2025-05-07T20:22:45.6048831Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest'
2025-05-07T20:22:45.6052714Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch'
2025-05-07T20:22:45.6056825Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json'
2025-05-07T20:22:45.6088024Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'...
2025-05-07T20:22:46.1641568Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'...
2025-05-07T20:22:46.6085491Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'...
2025-05-07T20:22:47.0510137Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'...
2025-05-07T20:22:48.1706791Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'...
2025-05-07T20:22:48.5045348Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'...
2025-05-07T20:22:48.7837553Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'...
2025-05-07T20:22:49.8658035Z From https://github.com/asmjit/asmjit
2025-05-07T20:22:49.8658699Z  * branch            e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD
2025-05-07T20:22:49.9126923Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:22:51.1939768Z From https://github.com/jwfromm/composable_kernel
2025-05-07T20:22:51.1940270Z  * branch            4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD
2025-05-07T20:22:51.4731636Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:22:52.1882961Z From https://github.com/pytorch/cpuinfo
2025-05-07T20:22:52.1884165Z  * branch            6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD
2025-05-07T20:22:52.2988420Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:22:53.4650700Z From https://github.com/jwfromm/cutlass
2025-05-07T20:22:53.4651161Z  * branch            3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD
2025-05-07T20:22:54.1616943Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:22:54.9598506Z From https://github.com/google/googletest
2025-05-07T20:22:54.9598960Z  * branch            f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD
2025-05-07T20:22:54.9999188Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:22:55.6035397Z From https://github.com/ROCmSoftwarePlatform/hipify_torch
2025-05-07T20:22:55.6035885Z  * branch            420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD
2025-05-07T20:22:55.6120516Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:22:56.3444387Z From https://github.com/nlohmann/json
2025-05-07T20:22:56.3444992Z  * branch            9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD
2025-05-07T20:22:56.4580367Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:22:56.4598651Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:22:56.4933642Z Entering 'external/asmjit'
2025-05-07T20:22:56.4965482Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.4997047Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.5028362Z Entering 'external/cutlass'
2025-05-07T20:22:56.5060683Z Entering 'external/googletest'
2025-05-07T20:22:56.5091684Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.5125205Z Entering 'external/json'
2025-05-07T20:22:56.5169657Z ##[endgroup]
2025-05-07T20:22:56.5170067Z ##[group]Persisting credentials for submodules
2025-05-07T20:22:56.5176568Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:22:56.5506477Z Entering 'external/asmjit'
2025-05-07T20:22:56.5572569Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.5644558Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.5711835Z Entering 'external/cutlass'
2025-05-07T20:22:56.5786872Z Entering 'external/googletest'
2025-05-07T20:22:56.5852612Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.5918727Z Entering 'external/json'
2025-05-07T20:22:56.6003561Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:22:56.6336826Z Entering 'external/asmjit'
2025-05-07T20:22:56.6400731Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:22:56.6403006Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.6463530Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:22:56.6466490Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.6526661Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:22:56.6529626Z Entering 'external/cutlass'
2025-05-07T20:22:56.6590423Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:22:56.6593376Z Entering 'external/googletest'
2025-05-07T20:22:56.6653891Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:22:56.6657256Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.6718203Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:22:56.6721107Z Entering 'external/json'
2025-05-07T20:22:56.6784015Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:22:56.6890434Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:22:56.7221487Z Entering 'external/asmjit'
2025-05-07T20:22:56.7253983Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.7285308Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.7317315Z Entering 'external/cutlass'
2025-05-07T20:22:56.7349185Z Entering 'external/googletest'
2025-05-07T20:22:56.7381787Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.7414142Z Entering 'external/json'
2025-05-07T20:22:56.7463841Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:22:56.7791454Z Entering 'external/asmjit'
2025-05-07T20:22:56.7825965Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.7859032Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.7890877Z Entering 'external/cutlass'
2025-05-07T20:22:56.7944295Z Entering 'external/googletest'
2025-05-07T20:22:56.7958745Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.7990022Z Entering 'external/json'
2025-05-07T20:22:56.8033900Z ##[endgroup]
2025-05-07T20:22:56.8075900Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:22:56.8102895Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:22:56.8277994Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:22:56.8278322Z with:
2025-05-07T20:22:56.8278569Z   name: fbgemm_genai_x86_clang_py3.10_cu12.6.3.whl
2025-05-07T20:22:56.8278890Z   merge-multiple: false
2025-05-07T20:22:56.8279153Z   repository: pytorch/FBGEMM
2025-05-07T20:22:56.8279415Z   run-id: 14891846252
2025-05-07T20:22:56.8279625Z env:
2025-05-07T20:22:56.8279850Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.8280151Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.8280429Z   BUILD_TARGET: genai
2025-05-07T20:22:56.8280681Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.8280925Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:56.8281180Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.8281424Z ##[endgroup]
2025-05-07T20:22:57.0613770Z Downloading single artifact
2025-05-07T20:22:57.1601487Z Preparing to download the following artifacts:
2025-05-07T20:22:57.1602396Z - fbgemm_genai_x86_clang_py3.10_cu12.6.3.whl (ID: 3081363083, Size: 12540944, Expected Digest: sha256:afbb98e930da7c62e149bc1ea88813f21873c24e8bb8269009e6340258c9d98e)
2025-05-07T20:22:57.2129883Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-0a2daaca-7a55-5fcf-bcc5-f66fdbd32d30/artifacts/648fc1a1b73d5d5cd1d464169b896b3a80c98aae0ebb5ca5326862fe4d644842.zip
2025-05-07T20:22:57.2131368Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:57.3075918Z (node:56910) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:22:57.3076889Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:22:57.4982128Z SHA256 digest of downloaded artifact is afbb98e930da7c62e149bc1ea88813f21873c24e8bb8269009e6340258c9d98e
2025-05-07T20:22:57.4982721Z Artifact download completed successfully.
2025-05-07T20:22:57.4983059Z Total of 1 artifact(s) downloaded
2025-05-07T20:22:57.4988613Z Download artifact has finished successfully
2025-05-07T20:22:57.5248679Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:22:57.5249067Z with:
2025-05-07T20:22:57.5249284Z   driver-version: 570.133.07
2025-05-07T20:22:57.5249532Z env:
2025-05-07T20:22:57.5249750Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:57.5250055Z   BUILD_ENV: build_binary
2025-05-07T20:22:57.5250312Z   BUILD_TARGET: genai
2025-05-07T20:22:57.5250545Z   BUILD_VARIANT: cuda
2025-05-07T20:22:57.5250796Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:57.5251064Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:57.5251306Z ##[endgroup]
2025-05-07T20:22:57.5351162Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:22:57.5351547Z with:
2025-05-07T20:22:57.5351925Z   timeout_minutes: 10
2025-05-07T20:22:57.5352157Z   max_attempts: 3
2025-05-07T20:22:57.5376116Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:22:57.5399903Z   retry_wait_seconds: 10
2025-05-07T20:22:57.5400165Z   polling_interval_seconds: 1
2025-05-07T20:22:57.5400426Z   warning_on_retry: true
2025-05-07T20:22:57.5400673Z   continue_on_error: false
2025-05-07T20:22:57.5400914Z env:
2025-05-07T20:22:57.5401152Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:57.5401486Z   BUILD_ENV: build_binary
2025-05-07T20:22:57.5401731Z   BUILD_TARGET: genai
2025-05-07T20:22:57.5401954Z   BUILD_VARIANT: cuda
2025-05-07T20:22:57.5402195Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:22:57.5402454Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:57.5402692Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:22:57.5402935Z ##[endgroup]
2025-05-07T20:22:57.6207303Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:22:57.6209261Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:22:57.6209683Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:22:58.2346419Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:22:58.2347157Z No packages marked for removal.
2025-05-07T20:22:58.2411526Z Dependencies resolved.
2025-05-07T20:22:58.2422195Z Nothing to do.
2025-05-07T20:22:58.2423149Z Complete!
2025-05-07T20:22:58.2744408Z + install_nvidia_driver_common
2025-05-07T20:22:58.2750599Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:22:58.2750951Z + lspci
2025-05-07T20:22:58.2751915Z Before installing NVIDIA driver
2025-05-07T20:22:58.2936027Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:58.2936874Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:58.2937467Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:58.2938359Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:58.2939137Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:58.2939723Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:58.2940220Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:58.2940710Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:58.2941159Z + lsmod
2025-05-07T20:22:58.2981536Z Module                  Size  Used by
2025-05-07T20:22:58.2982182Z xt_conntrack           16384  1
2025-05-07T20:22:58.2982705Z nft_chain_nat          16384  3
2025-05-07T20:22:58.2983215Z xt_MASQUERADE          20480  1
2025-05-07T20:22:58.2983828Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:58.2984497Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:58.2985290Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:58.2986168Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:58.2986798Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:58.2987389Z xfrm_user              57344  1
2025-05-07T20:22:58.2987908Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:58.2988487Z xt_addrtype            16384  2
2025-05-07T20:22:58.2989007Z nft_compat             20480  4
2025-05-07T20:22:58.2989603Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:58.2990438Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:58.2991125Z br_netfilter           36864  0
2025-05-07T20:22:58.2991448Z bridge                323584  1 br_netfilter
2025-05-07T20:22:58.2991756Z stp                    16384  1 bridge
2025-05-07T20:22:58.2992052Z llc                    16384  2 bridge,stp
2025-05-07T20:22:58.2992343Z overlay               167936  0
2025-05-07T20:22:58.2992588Z tls                   135168  0
2025-05-07T20:22:58.2992849Z nls_ascii              16384  1
2025-05-07T20:22:58.2993104Z nls_cp437              20480  1
2025-05-07T20:22:58.2993346Z vfat                   24576  1
2025-05-07T20:22:58.2993604Z fat                    86016  1 vfat
2025-05-07T20:22:58.2993878Z sunrpc                696320  1
2025-05-07T20:22:58.2994125Z i8042                  45056  0
2025-05-07T20:22:58.2994381Z serio                  28672  3 i8042
2025-05-07T20:22:58.2994657Z ena                   180224  0
2025-05-07T20:22:58.2994910Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:58.2995187Z button                 24576  0
2025-05-07T20:22:58.2995446Z sch_fq_codel           20480  17
2025-05-07T20:22:58.2995703Z dm_mod                188416  0
2025-05-07T20:22:58.2995956Z fuse                  163840  1
2025-05-07T20:22:58.2996213Z loop                   36864  0
2025-05-07T20:22:58.2996478Z dax                    45056  1 dm_mod
2025-05-07T20:22:58.2996752Z configfs               57344  1
2025-05-07T20:22:58.2997018Z dmi_sysfs              20480  0
2025-05-07T20:22:58.2997278Z crc32_pclmul           16384  0
2025-05-07T20:22:58.2997530Z crc32c_intel           24576  0
2025-05-07T20:22:58.2997793Z efivarfs               24576  1
2025-05-07T20:22:58.2998048Z + modinfo nvidia
2025-05-07T20:22:58.3000056Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:58.3000569Z import_ns:      DMA_BUF
2025-05-07T20:22:58.3000827Z alias:          char-major-195-*
2025-05-07T20:22:58.3001155Z version:        570.133.07
2025-05-07T20:22:58.3001404Z supported:      external
2025-05-07T20:22:58.3001666Z license:        Dual MIT/GPL
2025-05-07T20:22:58.3001959Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:58.3002415Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:58.3002923Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:58.3003256Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:58.3003602Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:58.3003935Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:58.3004253Z depends:        i2c-core,drm
2025-05-07T20:22:58.3004512Z retpoline:      Y
2025-05-07T20:22:58.3004740Z name:           nvidia
2025-05-07T20:22:58.3005098Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:58.3005595Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:58.3006084Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:58.3006866Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:58.3007362Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:58.3007838Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:58.3008351Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:58.3008770Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:58.3009087Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:58.3009463Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:58.3010032Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:58.3010565Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:58.3011071Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:58.3011485Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:58.3011866Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:58.3012270Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:58.3012652Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:58.3013085Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:58.3013501Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:58.3013936Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:58.3014349Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:58.3014694Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:58.3015074Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:58.3015449Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:58.3015798Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:58.3016130Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:58.3016463Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:58.3016795Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:58.3017116Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:58.3017463Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:58.3017847Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:58.3018281Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:58.3018631Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:58.3018983Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:58.3019328Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:58.3019679Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:58.3020012Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:58.3020311Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:58.3020645Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:58.3020973Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:58.3021296Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:58.3021638Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:58.3022004Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:58.3022358Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:58.3022700Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:58.3023061Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:58.3023404Z parm:           rm_firmware_active:charp
2025-05-07T20:22:58.3023851Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:22:58.3024107Z ++ command -v nvidia-smi
2025-05-07T20:22:58.3024370Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:22:58.3024634Z + set +e
2025-05-07T20:22:58.3024951Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:23:00.1163815Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:23:00.1164194Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:00.1164435Z + '[' 0 -ne 0 ']'
2025-05-07T20:23:00.1164667Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:23:00.1164945Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:23:00.1165381Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:23:00.1165865Z + set -e
2025-05-07T20:23:00.1166420Z + '[' 1 -eq 0 ']'
2025-05-07T20:23:00.1166824Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:23:00.1167289Z + post_install_nvidia_driver_common
2025-05-07T20:23:00.1170162Z + sudo modprobe nvidia
2025-05-07T20:23:00.2451983Z + echo 'After installing NVIDIA driver'
2025-05-07T20:23:00.2452317Z + lspci
2025-05-07T20:23:00.2452540Z After installing NVIDIA driver
2025-05-07T20:23:00.2572619Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:00.2573133Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:00.2573698Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:00.2574422Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:00.2575101Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:00.2575636Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:00.2576153Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:00.2576633Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:00.2577054Z + lsmod
2025-05-07T20:23:00.2604848Z Module                  Size  Used by
2025-05-07T20:23:00.2605163Z nvidia_uvm           1884160  0
2025-05-07T20:23:00.2605431Z nvidia              11583488  1 nvidia_uvm
2025-05-07T20:23:00.2605724Z drm                   602112  1 nvidia
2025-05-07T20:23:00.2606039Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:00.2606345Z backlight              24576  1 drm
2025-05-07T20:23:00.2606633Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:00.2606926Z xt_conntrack           16384  1
2025-05-07T20:23:00.2607191Z nft_chain_nat          16384  3
2025-05-07T20:23:00.2607447Z xt_MASQUERADE          20480  1
2025-05-07T20:23:00.2607751Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:00.2608093Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:00.2608491Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:00.2608930Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:00.2609252Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:00.2609546Z xfrm_user              57344  1
2025-05-07T20:23:00.2609815Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:00.2610106Z xt_addrtype            16384  2
2025-05-07T20:23:00.2610362Z nft_compat             20480  4
2025-05-07T20:23:00.2610674Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:00.2611082Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:00.2611455Z br_netfilter           36864  0
2025-05-07T20:23:00.2611729Z bridge                323584  1 br_netfilter
2025-05-07T20:23:00.2612024Z stp                    16384  1 bridge
2025-05-07T20:23:00.2612300Z llc                    16384  2 bridge,stp
2025-05-07T20:23:00.2612585Z overlay               167936  0
2025-05-07T20:23:00.2612835Z tls                   135168  0
2025-05-07T20:23:00.2613079Z nls_ascii              16384  1
2025-05-07T20:23:00.2613655Z nls_cp437              20480  1
2025-05-07T20:23:00.2613907Z vfat                   24576  1
2025-05-07T20:23:00.2614153Z fat                    86016  1 vfat
2025-05-07T20:23:00.2614422Z sunrpc                696320  1
2025-05-07T20:23:00.2614668Z i8042                  45056  0
2025-05-07T20:23:00.2614927Z serio                  28672  3 i8042
2025-05-07T20:23:00.2615196Z ena                   180224  0
2025-05-07T20:23:00.2615450Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:00.2615704Z button                 24576  0
2025-05-07T20:23:00.2615949Z sch_fq_codel           20480  17
2025-05-07T20:23:00.2616210Z dm_mod                188416  0
2025-05-07T20:23:00.2616457Z fuse                  163840  1
2025-05-07T20:23:00.2616700Z loop                   36864  0
2025-05-07T20:23:00.2617105Z dax                    45056  1 dm_mod
2025-05-07T20:23:00.2617379Z configfs               57344  1
2025-05-07T20:23:00.2617625Z dmi_sysfs              20480  0
2025-05-07T20:23:00.2617880Z crc32_pclmul           16384  0
2025-05-07T20:23:00.2618275Z crc32c_intel           24576  0
2025-05-07T20:23:00.2618551Z efivarfs               24576  1
2025-05-07T20:23:00.2618801Z + modinfo nvidia
2025-05-07T20:23:00.2623644Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:00.2624110Z import_ns:      DMA_BUF
2025-05-07T20:23:00.2624362Z alias:          char-major-195-*
2025-05-07T20:23:00.2624637Z version:        570.133.07
2025-05-07T20:23:00.2624888Z supported:      external
2025-05-07T20:23:00.2625135Z license:        Dual MIT/GPL
2025-05-07T20:23:00.2625424Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:00.2625764Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:00.2626076Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:00.2626405Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:00.2626748Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:00.2627080Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:00.2627388Z depends:        i2c-core,drm
2025-05-07T20:23:00.2627647Z retpoline:      Y
2025-05-07T20:23:00.2627876Z name:           nvidia
2025-05-07T20:23:00.2628230Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:00.2628698Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:00.2629143Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:00.2629550Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:00.2629862Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:00.2630161Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:00.2630475Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:00.2630773Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:00.2631085Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:00.2631449Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:00.2631832Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:00.2632169Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:00.2632473Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:00.2632774Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:00.2633136Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:00.2633531Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:00.2633907Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:00.2634314Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:00.2634726Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:00.2635143Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:00.2635550Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:00.2635887Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:00.2636258Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:00.2636737Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:00.2637084Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:00.2637406Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:00.2637736Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:00.2638055Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:00.2638367Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:00.2638716Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:00.2639071Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:00.2639401Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:00.2639741Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:00.2640085Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:00.2640512Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:00.2640859Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:00.2641186Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:00.2641485Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:00.2641811Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:00.2642137Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:00.2642449Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:00.2642782Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:00.2643141Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:00.2643484Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:00.2643815Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:00.2644164Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:00.2644496Z parm:           rm_firmware_active:charp
2025-05-07T20:23:00.2644782Z + set +e
2025-05-07T20:23:00.2644982Z + nvidia-smi
2025-05-07T20:23:01.6714108Z Wed May  7 20:23:01 2025       
2025-05-07T20:23:01.6714526Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.6715059Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:01.6715552Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:01.6716044Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:01.6716587Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:01.6717028Z |                                         |                        |               MIG M. |
2025-05-07T20:23:01.6717363Z |=========================================+========================+======================|
2025-05-07T20:23:01.6779830Z |   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:01.6780294Z |  0%   29C    P0             63W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:01.6780683Z |                                         |                        |                  N/A |
2025-05-07T20:23:01.6781074Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:01.6781470Z                                                                                          
2025-05-07T20:23:01.6781859Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.6782284Z | Processes:                                                                              |
2025-05-07T20:23:01.6782728Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:01.6783138Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:01.6783484Z |=========================================================================================|
2025-05-07T20:23:01.6784562Z |  No running processes found                                                             |
2025-05-07T20:23:01.6785315Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:02.0987070Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:03.5045692Z NVIDIA A10G
2025-05-07T20:23:03.7751961Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:03.7752250Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:03.7752502Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:03.7752792Z + set -e
2025-05-07T20:23:03.7753014Z INFO: Ignoring allowed status 0
2025-05-07T20:23:03.7763252Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:03.7765213Z + sudo yum install -y yum-utils
2025-05-07T20:23:04.2289004Z Last metadata expiration check: 0:05:26 ago on Wed May  7 20:17:38 2025.
2025-05-07T20:23:04.2534948Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:04.2929478Z Dependencies resolved.
2025-05-07T20:23:04.3109820Z Nothing to do.
2025-05-07T20:23:04.3110599Z Complete!
2025-05-07T20:23:04.3482675Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:04.3483301Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:04.3484167Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:04.6353867Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:04.6909914Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:05.2057510Z nvidia-container-toolkit                         14 kB/s | 833  B     00:00    
2025-05-07T20:23:05.2304388Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:05.2704512Z Dependencies resolved.
2025-05-07T20:23:05.2884350Z ================================================================================
2025-05-07T20:23:05.2884979Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:05.2885386Z ================================================================================
2025-05-07T20:23:05.2885701Z Downgrading:
2025-05-07T20:23:05.2886078Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:05.2886686Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:05.2887043Z 
2025-05-07T20:23:05.2887136Z Transaction Summary
2025-05-07T20:23:05.2887394Z ================================================================================
2025-05-07T20:23:05.2887713Z Downgrade  2 Packages
2025-05-07T20:23:05.2887864Z 
2025-05-07T20:23:05.2887976Z Total download size: 6.8 M
2025-05-07T20:23:05.2888598Z Downloading Packages:
2025-05-07T20:23:05.3605172Z (1/2): nvidia-container-toolkit-base-1.16.2-1.x  80 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:05.4116222Z (2/2): nvidia-container-toolkit-1.16.2-1.x86_64  10 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:05.4128836Z --------------------------------------------------------------------------------
2025-05-07T20:23:05.4132191Z Total                                            55 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:05.4134836Z Running transaction check
2025-05-07T20:23:05.4236679Z Transaction check succeeded.
2025-05-07T20:23:05.4237117Z Running transaction test
2025-05-07T20:23:05.4530111Z Transaction test succeeded.
2025-05-07T20:23:05.4532578Z Running transaction
2025-05-07T20:23:05.9986793Z   Preparing        :                                                        1/1 
2025-05-07T20:23:06.1029794Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:06.1050882Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:06.1251512Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:06.1252100Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:06.1350911Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:06.1373507Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:07.5374801Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:07.5375421Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:07.5375958Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:07.5376487Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:07.6781679Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:07.6783191Z WARNING:
2025-05-07T20:23:07.6783488Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:07.6783726Z 
2025-05-07T20:23:07.6783823Z   Available Versions:
2025-05-07T20:23:07.6783980Z 
2025-05-07T20:23:07.6784086Z   Version 2023.7.20250331:
2025-05-07T20:23:07.6784408Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:07.6784663Z 
2025-05-07T20:23:07.6784792Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:07.6785014Z 
2025-05-07T20:23:07.6785102Z     Release notes:
2025-05-07T20:23:07.6785520Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:07.6785897Z 
2025-05-07T20:23:07.6786247Z   Version 2023.7.20250414:
2025-05-07T20:23:07.6786622Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:07.6786973Z 
2025-05-07T20:23:07.6787124Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:07.6787448Z 
2025-05-07T20:23:07.6787580Z     Release notes:
2025-05-07T20:23:07.6788104Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:07.6788500Z 
2025-05-07T20:23:07.6788620Z   Version 2023.7.20250428:
2025-05-07T20:23:07.6802053Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:07.6802321Z 
2025-05-07T20:23:07.6802449Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:07.6802666Z 
2025-05-07T20:23:07.6802756Z     Release notes:
2025-05-07T20:23:07.6803218Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:07.6803586Z 
2025-05-07T20:23:07.6803708Z ================================================================================
2025-05-07T20:23:07.7145335Z  
2025-05-07T20:23:07.7145482Z 
2025-05-07T20:23:07.7145761Z Downgraded:
2025-05-07T20:23:07.7146135Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:07.7146713Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:07.7147073Z 
2025-05-07T20:23:07.7147158Z Complete!
2025-05-07T20:23:07.7615078Z + sudo systemctl restart docker
2025-05-07T20:23:11.7047071Z Wed May  7 20:23:11 2025       
2025-05-07T20:23:11.7047500Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:11.7048000Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:11.7048493Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:11.7048992Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:11.7049515Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:11.7049947Z |                                         |                        |               MIG M. |
2025-05-07T20:23:11.7050282Z |=========================================+========================+======================|
2025-05-07T20:23:11.7131700Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:11.7132642Z |  0%   29C    P0             63W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:11.7133047Z |                                         |                        |                  N/A |
2025-05-07T20:23:11.7133439Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:11.7133834Z                                                                                          
2025-05-07T20:23:11.7134234Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:11.7134697Z | Processes:                                                                              |
2025-05-07T20:23:11.7135139Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:11.7135710Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:11.7136048Z |=========================================================================================|
2025-05-07T20:23:11.7137560Z |  No running processes found                                                             |
2025-05-07T20:23:11.7138029Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:12.5995466Z Command completed after 1 attempt(s).
2025-05-07T20:23:12.6082808Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:12.6083298Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:12.6097009Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:12.6097359Z env:
2025-05-07T20:23:12.6097599Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:12.6097903Z   BUILD_ENV: build_binary
2025-05-07T20:23:12.6098261Z   BUILD_TARGET: genai
2025-05-07T20:23:12.6098510Z   BUILD_VARIANT: cuda
2025-05-07T20:23:12.6098746Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:12.6099014Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:12.6099322Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.6099652Z ##[endgroup]
2025-05-07T20:23:12.9478360Z ################################################################################
2025-05-07T20:23:12.9478839Z # Print System Info
2025-05-07T20:23:12.9479159Z #
2025-05-07T20:23:12.9494219Z # [2025-05-07T20:23:12.949Z] + print_system_info 
2025-05-07T20:23:12.9494730Z ################################################################################
2025-05-07T20:23:12.9495054Z 
2025-05-07T20:23:12.9495215Z ################################################################################
2025-05-07T20:23:12.9495699Z [INFO] Printing environment variables ...
2025-05-07T20:23:12.9496115Z + printenv
2025-05-07T20:23:12.9496293Z 
2025-05-07T20:23:12.9519336Z SHELL=/bin/bash
2025-05-07T20:23:12.9519726Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:12.9520257Z BUILD_VARIANT=cuda
2025-05-07T20:23:12.9521031Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_9737db9c-fa2f-4aa5-8f0b-5d1fd405ba6f
2025-05-07T20:23:12.9521888Z GITHUB_ACTION=__run
2025-05-07T20:23:12.9522326Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.9522840Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:12.9523210Z RUNNER_NAME=i-06f3d8044a6f79407
2025-05-07T20:23:12.9523611Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:12.9524083Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:12.9524517Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:12.9525048Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:12.9525664Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:12.9525957Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:12.9526252Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:12.9526754Z ***
2025-05-07T20:23:12.9526962Z LOGNAME=ec2-user
2025-05-07T20:23:12.9527199Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:12.9527479Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:12.9527718Z GITHUB_ACTIONS=true
2025-05-07T20:23:12.9527939Z SYSTEMD_EXEC_PID=55476
2025-05-07T20:23:12.9528223Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:12.9528771Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:12.9529277Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:12.9529564Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:12.9529828Z RUNNER_OS=Linux
2025-05-07T20:23:12.9530051Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:12.9530303Z HOME=/home/ec2-user
2025-05-07T20:23:12.9530557Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:12.9530846Z LANG=C.UTF-8
2025-05-07T20:23:12.9531152Z RUNNER_TRACKING_ID=github_b457ea54-0b6b-45b3-bdbc-45cac5aef1d8
2025-05-07T20:23:12.9531518Z RUNNER_ARCH=X64
2025-05-07T20:23:12.9531804Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:12.9532375Z BUILD_TARGET=genai
2025-05-07T20:23:12.9532917Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_9737db9c-fa2f-4aa5-8f0b-5d1fd405ba6f
2025-05-07T20:23:12.9533790Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_9737db9c-fa2f-4aa5-8f0b-5d1fd405ba6f
2025-05-07T20:23:12.9534531Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:12.9535209Z INVOCATION_ID=1482d53b51c24cadbdb69d1e5516bd3d
2025-05-07T20:23:12.9535544Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:12.9535814Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:12.9536397Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_9737db9c-fa2f-4aa5-8f0b-5d1fd405ba6f
2025-05-07T20:23:12.9537015Z BUILD_ENV=build_binary
2025-05-07T20:23:12.9537255Z GITHUB_ACTOR=q10
2025-05-07T20:23:12.9537473Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:12.9537704Z KERN_NAME_LC=linux
2025-05-07T20:23:12.9537941Z BUILD_CUDA_VERSION=12.6.3
2025-05-07T20:23:12.9538353Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:12.9538701Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:12.9538953Z USER=ec2-user
2025-05-07T20:23:12.9539191Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:12.9539478Z SHLVL=1
2025-05-07T20:23:12.9539684Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:12.9539997Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:12.9540445Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:12.9540813Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:12.9541061Z KERN_NAME=Linux
2025-05-07T20:23:12.9541292Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:12.9541707Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:12.9542140Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:12.9542460Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:12.9542813Z JOURNAL_STREAM=8:90275
2025-05-07T20:23:12.9543289Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:12.9543820Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:12.9544291Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:12.9544787Z GITHUB_BASE_REF=main
2025-05-07T20:23:12.9545098Z CI=true
2025-05-07T20:23:12.9545393Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:12.9545774Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:12.9546160Z GITHUB_ACTION_REF=
2025-05-07T20:23:12.9546512Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:12.9547401Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_9737db9c-fa2f-4aa5-8f0b-5d1fd405ba6f
2025-05-07T20:23:12.9548083Z MACHINE_NAME=x86_64
2025-05-07T20:23:12.9548307Z _=/usr/bin/printenv
2025-05-07T20:23:12.9548452Z 
2025-05-07T20:23:12.9548577Z ################################################################################
2025-05-07T20:23:12.9548897Z [INFO] Print ldd version ...
2025-05-07T20:23:12.9549166Z + ldd --version
2025-05-07T20:23:12.9549304Z 
2025-05-07T20:23:12.9549403Z ldd (GNU libc) 2.34
2025-05-07T20:23:12.9549681Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:12.9550129Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:12.9550670Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:12.9551132Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:12.9551357Z 
2025-05-07T20:23:12.9551482Z ################################################################################
2025-05-07T20:23:12.9551790Z [INFO] Print CPU info ...
2025-05-07T20:23:12.9552037Z + nproc
2025-05-07T20:23:12.9552154Z 
2025-05-07T20:23:12.9569674Z 16
2025-05-07T20:23:12.9571771Z 
2025-05-07T20:23:12.9572059Z + lscpu
2025-05-07T20:23:12.9572244Z 
2025-05-07T20:23:12.9685484Z Architecture:                         x86_64
2025-05-07T20:23:12.9686489Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:12.9690698Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9691486Z Byte Order:                           Little Endian
2025-05-07T20:23:12.9692121Z CPU(s):                               16
2025-05-07T20:23:12.9692710Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:12.9693351Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:12.9694035Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:12.9694484Z CPU family:                           23
2025-05-07T20:23:12.9694961Z Model:                                49
2025-05-07T20:23:12.9695260Z Thread(s) per core:                   2
2025-05-07T20:23:12.9695553Z Core(s) per socket:                   8
2025-05-07T20:23:12.9695850Z Socket(s):                            1
2025-05-07T20:23:12.9696133Z Stepping:                             0
2025-05-07T20:23:12.9696436Z BogoMIPS:                             5599.99
2025-05-07T20:23:12.9698646Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9700763Z Hypervisor vendor:                    KVM
2025-05-07T20:23:12.9701076Z Virtualization type:                  full
2025-05-07T20:23:12.9701592Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.9701962Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.9702354Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:12.9702844Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:12.9703305Z NUMA node(s):                         1
2025-05-07T20:23:12.9703732Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:12.9704207Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:12.9704694Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:12.9705081Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:12.9705438Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:12.9705896Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:12.9706434Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:12.9706811Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:12.9707591Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:12.9708410Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:12.9709015Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:12.9709708Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:12.9710619Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:12.9711369Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:12.9711756Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:12.9712084Z 
2025-05-07T20:23:12.9712177Z + cat /proc/cpuinfo
2025-05-07T20:23:12.9712317Z 
2025-05-07T20:23:12.9712410Z processor	: 0
2025-05-07T20:23:12.9712626Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9712881Z cpu family	: 23
2025-05-07T20:23:12.9713096Z model		: 49
2025-05-07T20:23:12.9713304Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9713556Z stepping	: 0
2025-05-07T20:23:12.9713775Z microcode	: 0x830107f
2025-05-07T20:23:12.9714126Z cpu MHz		: 3305.720
2025-05-07T20:23:12.9714349Z cache size	: 512 KB
2025-05-07T20:23:12.9714582Z physical id	: 0
2025-05-07T20:23:12.9714791Z siblings	: 16
2025-05-07T20:23:12.9714998Z core id		: 0
2025-05-07T20:23:12.9715202Z cpu cores	: 8
2025-05-07T20:23:12.9715403Z apicid		: 0
2025-05-07T20:23:12.9715611Z initial apicid	: 0
2025-05-07T20:23:12.9715830Z fpu		: yes
2025-05-07T20:23:12.9716027Z fpu_exception	: yes
2025-05-07T20:23:12.9716252Z cpuid level	: 13
2025-05-07T20:23:12.9716463Z wp		: yes
2025-05-07T20:23:12.9718591Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9720887Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9721378Z bogomips	: 5599.99
2025-05-07T20:23:12.9721603Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9721843Z clflush size	: 64
2025-05-07T20:23:12.9722061Z cache_alignment	: 64
2025-05-07T20:23:12.9722333Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9722657Z power management:
2025-05-07T20:23:12.9722794Z 
2025-05-07T20:23:12.9722881Z processor	: 1
2025-05-07T20:23:12.9723099Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9723340Z cpu family	: 23
2025-05-07T20:23:12.9723543Z model		: 49
2025-05-07T20:23:12.9723752Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9723999Z stepping	: 0
2025-05-07T20:23:12.9724229Z microcode	: 0x830107f
2025-05-07T20:23:12.9724483Z cpu MHz		: 3136.739
2025-05-07T20:23:12.9724701Z cache size	: 512 KB
2025-05-07T20:23:12.9724919Z physical id	: 0
2025-05-07T20:23:12.9725134Z siblings	: 16
2025-05-07T20:23:12.9725336Z core id		: 1
2025-05-07T20:23:12.9725533Z cpu cores	: 8
2025-05-07T20:23:12.9725735Z apicid		: 2
2025-05-07T20:23:12.9725934Z initial apicid	: 2
2025-05-07T20:23:12.9726141Z fpu		: yes
2025-05-07T20:23:12.9726341Z fpu_exception	: yes
2025-05-07T20:23:12.9726562Z cpuid level	: 13
2025-05-07T20:23:12.9726769Z wp		: yes
2025-05-07T20:23:12.9728772Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9731055Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9731549Z bogomips	: 5599.99
2025-05-07T20:23:12.9731771Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9732006Z clflush size	: 64
2025-05-07T20:23:12.9732231Z cache_alignment	: 64
2025-05-07T20:23:12.9732503Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9732815Z power management:
2025-05-07T20:23:12.9732954Z 
2025-05-07T20:23:12.9733041Z processor	: 2
2025-05-07T20:23:12.9733258Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9733492Z cpu family	: 23
2025-05-07T20:23:12.9733703Z model		: 49
2025-05-07T20:23:12.9733914Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9734155Z stepping	: 0
2025-05-07T20:23:12.9734402Z microcode	: 0x830107f
2025-05-07T20:23:12.9734647Z cpu MHz		: 3312.706
2025-05-07T20:23:12.9734865Z cache size	: 512 KB
2025-05-07T20:23:12.9735077Z physical id	: 0
2025-05-07T20:23:12.9735289Z siblings	: 16
2025-05-07T20:23:12.9735579Z core id		: 2
2025-05-07T20:23:12.9735775Z cpu cores	: 8
2025-05-07T20:23:12.9735977Z apicid		: 4
2025-05-07T20:23:12.9736183Z initial apicid	: 4
2025-05-07T20:23:12.9736390Z fpu		: yes
2025-05-07T20:23:12.9736593Z fpu_exception	: yes
2025-05-07T20:23:12.9736815Z cpuid level	: 13
2025-05-07T20:23:12.9737018Z wp		: yes
2025-05-07T20:23:12.9739215Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9741517Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9742010Z bogomips	: 5599.99
2025-05-07T20:23:12.9742235Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9742483Z clflush size	: 64
2025-05-07T20:23:12.9742698Z cache_alignment	: 64
2025-05-07T20:23:12.9742976Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9743297Z power management:
2025-05-07T20:23:12.9743430Z 
2025-05-07T20:23:12.9743520Z processor	: 3
2025-05-07T20:23:12.9743730Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9743978Z cpu family	: 23
2025-05-07T20:23:12.9744186Z model		: 49
2025-05-07T20:23:12.9744403Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9744686Z stepping	: 0
2025-05-07T20:23:12.9744899Z microcode	: 0x830107f
2025-05-07T20:23:12.9745123Z cpu MHz		: 3299.680
2025-05-07T20:23:12.9745343Z cache size	: 512 KB
2025-05-07T20:23:12.9745559Z physical id	: 0
2025-05-07T20:23:12.9745764Z siblings	: 16
2025-05-07T20:23:12.9745968Z core id		: 3
2025-05-07T20:23:12.9746178Z cpu cores	: 8
2025-05-07T20:23:12.9746373Z apicid		: 6
2025-05-07T20:23:12.9746651Z initial apicid	: 6
2025-05-07T20:23:12.9746967Z fpu		: yes
2025-05-07T20:23:12.9761194Z fpu_exception	: yes
2025-05-07T20:23:12.9761478Z cpuid level	: 13
2025-05-07T20:23:12.9761703Z wp		: yes
2025-05-07T20:23:12.9763709Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9766046Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9766550Z bogomips	: 5599.99
2025-05-07T20:23:12.9766786Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9767029Z clflush size	: 64
2025-05-07T20:23:12.9767255Z cache_alignment	: 64
2025-05-07T20:23:12.9767535Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9767862Z power management:
2025-05-07T20:23:12.9767999Z 
2025-05-07T20:23:12.9768082Z processor	: 4
2025-05-07T20:23:12.9768306Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9768559Z cpu family	: 23
2025-05-07T20:23:12.9768767Z model		: 49
2025-05-07T20:23:12.9768986Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9769235Z stepping	: 0
2025-05-07T20:23:12.9769443Z microcode	: 0x830107f
2025-05-07T20:23:12.9769677Z cpu MHz		: 3285.380
2025-05-07T20:23:12.9769902Z cache size	: 512 KB
2025-05-07T20:23:12.9770117Z physical id	: 0
2025-05-07T20:23:12.9770332Z siblings	: 16
2025-05-07T20:23:12.9770539Z core id		: 4
2025-05-07T20:23:12.9770736Z cpu cores	: 8
2025-05-07T20:23:12.9770942Z apicid		: 8
2025-05-07T20:23:12.9771355Z initial apicid	: 8
2025-05-07T20:23:12.9771565Z fpu		: yes
2025-05-07T20:23:12.9771772Z fpu_exception	: yes
2025-05-07T20:23:12.9771996Z cpuid level	: 13
2025-05-07T20:23:12.9772201Z wp		: yes
2025-05-07T20:23:12.9774326Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9776660Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9777158Z bogomips	: 5599.99
2025-05-07T20:23:12.9777388Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9777624Z clflush size	: 64
2025-05-07T20:23:12.9777850Z cache_alignment	: 64
2025-05-07T20:23:12.9778244Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9778566Z power management:
2025-05-07T20:23:12.9778711Z 
2025-05-07T20:23:12.9778794Z processor	: 5
2025-05-07T20:23:12.9779011Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9779252Z cpu family	: 23
2025-05-07T20:23:12.9779463Z model		: 49
2025-05-07T20:23:12.9779671Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9779918Z stepping	: 0
2025-05-07T20:23:12.9780130Z microcode	: 0x830107f
2025-05-07T20:23:12.9780358Z cpu MHz		: 3328.621
2025-05-07T20:23:12.9780570Z cache size	: 512 KB
2025-05-07T20:23:12.9780787Z physical id	: 0
2025-05-07T20:23:12.9781002Z siblings	: 16
2025-05-07T20:23:12.9781200Z core id		: 5
2025-05-07T20:23:12.9781404Z cpu cores	: 8
2025-05-07T20:23:12.9781617Z apicid		: 10
2025-05-07T20:23:12.9781819Z initial apicid	: 10
2025-05-07T20:23:12.9782035Z fpu		: yes
2025-05-07T20:23:12.9782245Z fpu_exception	: yes
2025-05-07T20:23:12.9782474Z cpuid level	: 13
2025-05-07T20:23:12.9782681Z wp		: yes
2025-05-07T20:23:12.9784675Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9786938Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9787434Z bogomips	: 5599.99
2025-05-07T20:23:12.9787652Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9787894Z clflush size	: 64
2025-05-07T20:23:12.9788127Z cache_alignment	: 64
2025-05-07T20:23:12.9788404Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9788727Z power management:
2025-05-07T20:23:12.9788861Z 
2025-05-07T20:23:12.9788952Z processor	: 6
2025-05-07T20:23:12.9789165Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9789413Z cpu family	: 23
2025-05-07T20:23:12.9789623Z model		: 49
2025-05-07T20:23:12.9789836Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9790078Z stepping	: 0
2025-05-07T20:23:12.9790294Z microcode	: 0x830107f
2025-05-07T20:23:12.9790531Z cpu MHz		: 2138.347
2025-05-07T20:23:12.9790745Z cache size	: 512 KB
2025-05-07T20:23:12.9790967Z physical id	: 0
2025-05-07T20:23:12.9791183Z siblings	: 16
2025-05-07T20:23:12.9791379Z core id		: 6
2025-05-07T20:23:12.9791581Z cpu cores	: 8
2025-05-07T20:23:12.9791788Z apicid		: 12
2025-05-07T20:23:12.9791991Z initial apicid	: 12
2025-05-07T20:23:12.9792209Z fpu		: yes
2025-05-07T20:23:12.9792415Z fpu_exception	: yes
2025-05-07T20:23:12.9792631Z cpuid level	: 13
2025-05-07T20:23:12.9792939Z wp		: yes
2025-05-07T20:23:12.9795074Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9797368Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9797861Z bogomips	: 5599.99
2025-05-07T20:23:12.9798087Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9798331Z clflush size	: 64
2025-05-07T20:23:12.9798549Z cache_alignment	: 64
2025-05-07T20:23:12.9798836Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9799158Z power management:
2025-05-07T20:23:12.9799293Z 
2025-05-07T20:23:12.9799384Z processor	: 7
2025-05-07T20:23:12.9799600Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9799850Z cpu family	: 23
2025-05-07T20:23:12.9800059Z model		: 49
2025-05-07T20:23:12.9800264Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9800511Z stepping	: 0
2025-05-07T20:23:12.9800724Z microcode	: 0x830107f
2025-05-07T20:23:12.9800949Z cpu MHz		: 3318.654
2025-05-07T20:23:12.9801176Z cache size	: 512 KB
2025-05-07T20:23:12.9801398Z physical id	: 0
2025-05-07T20:23:12.9801605Z siblings	: 16
2025-05-07T20:23:12.9801812Z core id		: 7
2025-05-07T20:23:12.9802011Z cpu cores	: 8
2025-05-07T20:23:12.9802206Z apicid		: 14
2025-05-07T20:23:12.9802417Z initial apicid	: 14
2025-05-07T20:23:12.9802636Z fpu		: yes
2025-05-07T20:23:12.9802838Z fpu_exception	: yes
2025-05-07T20:23:12.9803062Z cpuid level	: 13
2025-05-07T20:23:12.9803273Z wp		: yes
2025-05-07T20:23:12.9805327Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9807592Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9808085Z bogomips	: 5599.99
2025-05-07T20:23:12.9808304Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9808537Z clflush size	: 64
2025-05-07T20:23:12.9808759Z cache_alignment	: 64
2025-05-07T20:23:12.9809024Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9809339Z power management:
2025-05-07T20:23:12.9809470Z 
2025-05-07T20:23:12.9809554Z processor	: 8
2025-05-07T20:23:12.9809761Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9809996Z cpu family	: 23
2025-05-07T20:23:12.9810199Z model		: 49
2025-05-07T20:23:12.9810398Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9810638Z stepping	: 0
2025-05-07T20:23:12.9810851Z microcode	: 0x830107f
2025-05-07T20:23:12.9811077Z cpu MHz		: 3307.796
2025-05-07T20:23:12.9811296Z cache size	: 512 KB
2025-05-07T20:23:12.9811516Z physical id	: 0
2025-05-07T20:23:12.9811729Z siblings	: 16
2025-05-07T20:23:12.9811977Z core id		: 0
2025-05-07T20:23:12.9812262Z cpu cores	: 8
2025-05-07T20:23:12.9812457Z apicid		: 1
2025-05-07T20:23:12.9812654Z initial apicid	: 1
2025-05-07T20:23:12.9812863Z fpu		: yes
2025-05-07T20:23:12.9813056Z fpu_exception	: yes
2025-05-07T20:23:12.9813296Z cpuid level	: 13
2025-05-07T20:23:12.9813575Z wp		: yes
2025-05-07T20:23:12.9815570Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9818023Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9818584Z bogomips	: 5599.99
2025-05-07T20:23:12.9818810Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9819050Z clflush size	: 64
2025-05-07T20:23:12.9819268Z cache_alignment	: 64
2025-05-07T20:23:12.9819546Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9819864Z power management:
2025-05-07T20:23:12.9819999Z 
2025-05-07T20:23:12.9820090Z processor	: 9
2025-05-07T20:23:12.9820310Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9820553Z cpu family	: 23
2025-05-07T20:23:12.9820758Z model		: 49
2025-05-07T20:23:12.9820966Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9821211Z stepping	: 0
2025-05-07T20:23:12.9821417Z microcode	: 0x830107f
2025-05-07T20:23:12.9821644Z cpu MHz		: 3231.751
2025-05-07T20:23:12.9821861Z cache size	: 512 KB
2025-05-07T20:23:12.9822131Z physical id	: 0
2025-05-07T20:23:12.9822340Z siblings	: 16
2025-05-07T20:23:12.9822544Z core id		: 1
2025-05-07T20:23:12.9822755Z cpu cores	: 8
2025-05-07T20:23:12.9822953Z apicid		: 3
2025-05-07T20:23:12.9823156Z initial apicid	: 3
2025-05-07T20:23:12.9823374Z fpu		: yes
2025-05-07T20:23:12.9823572Z fpu_exception	: yes
2025-05-07T20:23:12.9823796Z cpuid level	: 13
2025-05-07T20:23:12.9824006Z wp		: yes
2025-05-07T20:23:12.9826023Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9828281Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9828778Z bogomips	: 5599.99
2025-05-07T20:23:12.9829002Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9829235Z clflush size	: 64
2025-05-07T20:23:12.9829457Z cache_alignment	: 64
2025-05-07T20:23:12.9829738Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9830059Z power management:
2025-05-07T20:23:12.9830215Z 
2025-05-07T20:23:12.9830299Z processor	: 10
2025-05-07T20:23:12.9830522Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9830777Z cpu family	: 23
2025-05-07T20:23:12.9830980Z model		: 49
2025-05-07T20:23:12.9831189Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9831434Z stepping	: 0
2025-05-07T20:23:12.9831641Z microcode	: 0x830107f
2025-05-07T20:23:12.9831870Z cpu MHz		: 3342.585
2025-05-07T20:23:12.9832089Z cache size	: 512 KB
2025-05-07T20:23:12.9832303Z physical id	: 0
2025-05-07T20:23:12.9832519Z siblings	: 16
2025-05-07T20:23:12.9832724Z core id		: 2
2025-05-07T20:23:12.9832920Z cpu cores	: 8
2025-05-07T20:23:12.9833126Z apicid		: 5
2025-05-07T20:23:12.9833333Z initial apicid	: 5
2025-05-07T20:23:12.9833543Z fpu		: yes
2025-05-07T20:23:12.9833746Z fpu_exception	: yes
2025-05-07T20:23:12.9833969Z cpuid level	: 13
2025-05-07T20:23:12.9834175Z wp		: yes
2025-05-07T20:23:12.9836209Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9838553Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9839045Z bogomips	: 5599.99
2025-05-07T20:23:12.9839358Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9839595Z clflush size	: 64
2025-05-07T20:23:12.9839817Z cache_alignment	: 64
2025-05-07T20:23:12.9840093Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9840406Z power management:
2025-05-07T20:23:12.9840546Z 
2025-05-07T20:23:12.9840632Z processor	: 11
2025-05-07T20:23:12.9840853Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9841091Z cpu family	: 23
2025-05-07T20:23:12.9841301Z model		: 49
2025-05-07T20:23:12.9841517Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9841758Z stepping	: 0
2025-05-07T20:23:12.9841975Z microcode	: 0x830107f
2025-05-07T20:23:12.9842204Z cpu MHz		: 3299.652
2025-05-07T20:23:12.9842417Z cache size	: 512 KB
2025-05-07T20:23:12.9842637Z physical id	: 0
2025-05-07T20:23:12.9842853Z siblings	: 16
2025-05-07T20:23:12.9843052Z core id		: 3
2025-05-07T20:23:12.9843256Z cpu cores	: 8
2025-05-07T20:23:12.9843459Z apicid		: 7
2025-05-07T20:23:12.9843655Z initial apicid	: 7
2025-05-07T20:23:12.9843878Z fpu		: yes
2025-05-07T20:23:12.9844080Z fpu_exception	: yes
2025-05-07T20:23:12.9844295Z cpuid level	: 13
2025-05-07T20:23:12.9844505Z wp		: yes
2025-05-07T20:23:12.9846483Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9848735Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9849227Z bogomips	: 5599.99
2025-05-07T20:23:12.9849445Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9849689Z clflush size	: 64
2025-05-07T20:23:12.9849914Z cache_alignment	: 64
2025-05-07T20:23:12.9850185Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9850503Z power management:
2025-05-07T20:23:12.9850635Z 
2025-05-07T20:23:12.9850724Z processor	: 12
2025-05-07T20:23:12.9850945Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9851188Z cpu family	: 23
2025-05-07T20:23:12.9851403Z model		: 49
2025-05-07T20:23:12.9851608Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9851861Z stepping	: 0
2025-05-07T20:23:12.9852079Z microcode	: 0x830107f
2025-05-07T20:23:12.9852310Z cpu MHz		: 3303.989
2025-05-07T20:23:12.9852523Z cache size	: 512 KB
2025-05-07T20:23:12.9852741Z physical id	: 0
2025-05-07T20:23:12.9852950Z siblings	: 16
2025-05-07T20:23:12.9853151Z core id		: 4
2025-05-07T20:23:12.9853354Z cpu cores	: 8
2025-05-07T20:23:12.9853555Z apicid		: 9
2025-05-07T20:23:12.9853754Z initial apicid	: 9
2025-05-07T20:23:12.9853970Z fpu		: yes
2025-05-07T20:23:12.9854174Z fpu_exception	: yes
2025-05-07T20:23:12.9854428Z cpuid level	: 13
2025-05-07T20:23:12.9854649Z wp		: yes
2025-05-07T20:23:12.9857488Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9860009Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9860496Z bogomips	: 5599.99
2025-05-07T20:23:12.9860717Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9860960Z clflush size	: 64
2025-05-07T20:23:12.9861177Z cache_alignment	: 64
2025-05-07T20:23:12.9861581Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9861906Z power management:
2025-05-07T20:23:12.9862042Z 
2025-05-07T20:23:12.9862133Z processor	: 13
2025-05-07T20:23:12.9862347Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9862586Z cpu family	: 23
2025-05-07T20:23:12.9862793Z model		: 49
2025-05-07T20:23:12.9863004Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9863259Z stepping	: 0
2025-05-07T20:23:12.9863478Z microcode	: 0x830107f
2025-05-07T20:23:12.9863717Z cpu MHz		: 3316.029
2025-05-07T20:23:12.9863935Z cache size	: 512 KB
2025-05-07T20:23:12.9864178Z physical id	: 0
2025-05-07T20:23:12.9864401Z siblings	: 16
2025-05-07T20:23:12.9864643Z core id		: 5
2025-05-07T20:23:12.9864846Z cpu cores	: 8
2025-05-07T20:23:12.9865042Z apicid		: 11
2025-05-07T20:23:12.9865262Z initial apicid	: 11
2025-05-07T20:23:12.9865556Z fpu		: yes
2025-05-07T20:23:12.9865782Z fpu_exception	: yes
2025-05-07T20:23:12.9866061Z cpuid level	: 13
2025-05-07T20:23:12.9866266Z wp		: yes
2025-05-07T20:23:12.9868408Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9870759Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9871246Z bogomips	: 5599.99
2025-05-07T20:23:12.9871462Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9871697Z clflush size	: 64
2025-05-07T20:23:12.9871914Z cache_alignment	: 64
2025-05-07T20:23:12.9872186Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9872506Z power management:
2025-05-07T20:23:12.9872641Z 
2025-05-07T20:23:12.9872729Z processor	: 14
2025-05-07T20:23:12.9872940Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9873179Z cpu family	: 23
2025-05-07T20:23:12.9873386Z model		: 49
2025-05-07T20:23:12.9873585Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9873828Z stepping	: 0
2025-05-07T20:23:12.9874037Z microcode	: 0x830107f
2025-05-07T20:23:12.9874256Z cpu MHz		: 2976.858
2025-05-07T20:23:12.9874480Z cache size	: 512 KB
2025-05-07T20:23:12.9874690Z physical id	: 0
2025-05-07T20:23:12.9874899Z siblings	: 16
2025-05-07T20:23:12.9875103Z core id		: 6
2025-05-07T20:23:12.9875299Z cpu cores	: 8
2025-05-07T20:23:12.9875501Z apicid		: 13
2025-05-07T20:23:12.9875709Z initial apicid	: 13
2025-05-07T20:23:12.9875919Z fpu		: yes
2025-05-07T20:23:12.9876119Z fpu_exception	: yes
2025-05-07T20:23:12.9876340Z cpuid level	: 13
2025-05-07T20:23:12.9876548Z wp		: yes
2025-05-07T20:23:12.9878535Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9881022Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9881516Z bogomips	: 5599.99
2025-05-07T20:23:12.9881735Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9881970Z clflush size	: 64
2025-05-07T20:23:12.9882195Z cache_alignment	: 64
2025-05-07T20:23:12.9882468Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9882776Z power management:
2025-05-07T20:23:12.9882919Z 
2025-05-07T20:23:12.9883102Z processor	: 15
2025-05-07T20:23:12.9883329Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.9883563Z cpu family	: 23
2025-05-07T20:23:12.9883774Z model		: 49
2025-05-07T20:23:12.9883981Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.9884222Z stepping	: 0
2025-05-07T20:23:12.9884467Z microcode	: 0x830107f
2025-05-07T20:23:12.9884708Z cpu MHz		: 3305.938
2025-05-07T20:23:12.9884916Z cache size	: 512 KB
2025-05-07T20:23:12.9885136Z physical id	: 0
2025-05-07T20:23:12.9885357Z siblings	: 16
2025-05-07T20:23:12.9885557Z core id		: 7
2025-05-07T20:23:12.9885757Z cpu cores	: 8
2025-05-07T20:23:12.9885959Z apicid		: 15
2025-05-07T20:23:12.9886163Z initial apicid	: 15
2025-05-07T20:23:12.9886378Z fpu		: yes
2025-05-07T20:23:12.9886579Z fpu_exception	: yes
2025-05-07T20:23:12.9886798Z cpuid level	: 13
2025-05-07T20:23:12.9887000Z wp		: yes
2025-05-07T20:23:12.9888999Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.9891280Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.9891774Z bogomips	: 5599.99
2025-05-07T20:23:12.9891992Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.9892227Z clflush size	: 64
2025-05-07T20:23:12.9892449Z cache_alignment	: 64
2025-05-07T20:23:12.9892722Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.9893035Z power management:
2025-05-07T20:23:12.9893168Z 
2025-05-07T20:23:12.9893177Z 
2025-05-07T20:23:12.9893298Z ################################################################################
2025-05-07T20:23:12.9893607Z [INFO] Print PCI info ...
2025-05-07T20:23:12.9893848Z + lspci -v
2025-05-07T20:23:12.9893969Z 
2025-05-07T20:23:12.9894183Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:12.9894617Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:12.9894952Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:12.9895164Z 
2025-05-07T20:23:12.9895368Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:12.9895758Z 	Physical Slot: 1
2025-05-07T20:23:12.9896007Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.9896212Z 
2025-05-07T20:23:12.9896468Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:12.9896899Z 	Physical Slot: 1
2025-05-07T20:23:12.9897163Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:12.9897390Z 
2025-05-07T20:23:12.9897664Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:12.9898202Z 	Physical Slot: 3
2025-05-07T20:23:12.9898451Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.9898798Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.9899160Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:12.9899385Z 
2025-05-07T20:23:12.9899691Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.9900297Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.9900585Z 	Physical Slot: 4
2025-05-07T20:23:12.9900840Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:12.9901224Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.9901583Z 	Capabilities: <access denied>
2025-05-07T20:23:12.9901844Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.9902011Z 
2025-05-07T20:23:12.9902503Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.9902988Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.9903334Z 	Physical Slot: 5
2025-05-07T20:23:12.9903577Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.9903935Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.9904334Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.9912360Z 	Capabilities: <access denied>
2025-05-07T20:23:12.9912655Z 	Kernel driver in use: ena
2025-05-07T20:23:12.9912896Z 	Kernel modules: ena
2025-05-07T20:23:12.9913045Z 
2025-05-07T20:23:12.9913219Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:12.9913606Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:12.9913906Z 	Physical Slot: 30
2025-05-07T20:23:12.9914295Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:12.9914750Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:12.9915154Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:12.9915607Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:12.9915947Z 	Capabilities: <access denied>
2025-05-07T20:23:12.9916217Z 	Kernel driver in use: nvidia
2025-05-07T20:23:12.9916469Z 	Kernel modules: nvidia
2025-05-07T20:23:12.9916626Z 
2025-05-07T20:23:12.9916935Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.9917462Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.9917756Z 	Physical Slot: 31
2025-05-07T20:23:12.9917996Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.9918359Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.9918755Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:12.9919081Z 	Capabilities: <access denied>
2025-05-07T20:23:12.9919349Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.9919514Z 
2025-05-07T20:23:12.9919518Z 
2025-05-07T20:23:12.9919643Z ################################################################################
2025-05-07T20:23:12.9919973Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:12.9920257Z + uname -a
2025-05-07T20:23:12.9920377Z 
2025-05-07T20:23:12.9920783Z Linux ip-10-0-69-200.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:12.9921280Z 
2025-05-07T20:23:12.9921361Z + uname -m
2025-05-07T20:23:12.9921478Z 
2025-05-07T20:23:12.9921550Z x86_64
2025-05-07T20:23:12.9921659Z 
2025-05-07T20:23:12.9921741Z + cat /proc/version
2025-05-07T20:23:12.9921881Z 
2025-05-07T20:23:12.9922423Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:12.9923052Z 
2025-05-07T20:23:12.9923151Z + cat /etc/os-release
2025-05-07T20:23:12.9923295Z 
2025-05-07T20:23:12.9923384Z NAME="Amazon Linux"
2025-05-07T20:23:12.9923602Z VERSION="2023"
2025-05-07T20:23:12.9923800Z ID="amzn"
2025-05-07T20:23:12.9923984Z ID_LIKE="fedora"
2025-05-07T20:23:12.9924194Z VERSION_ID="2023"
2025-05-07T20:23:12.9924435Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:12.9924775Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:12.9925056Z ANSI_COLOR="0;33"
2025-05-07T20:23:12.9925304Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:12.9925832Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:12.9926266Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:12.9926682Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:12.9927123Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:12.9927490Z VENDOR_NAME="AWS"
2025-05-07T20:23:12.9927734Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:12.9928020Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:12.9928175Z 
2025-05-07T20:23:12.9928414Z ################################################################################
2025-05-07T20:23:12.9928716Z # Print EC2 Instance Info
2025-05-07T20:23:12.9928956Z #
2025-05-07T20:23:12.9929177Z # [2025-05-07T20:23:12.990Z] + print_ec2_info 
2025-05-07T20:23:12.9929488Z ################################################################################
2025-05-07T20:23:12.9929705Z 
2025-05-07T20:23:13.0028558Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:13.0146803Z instance-id: i-06f3d8044a6f79407
2025-05-07T20:23:13.0254352Z instance-type: g5.4xlarge
2025-05-07T20:23:13.0294069Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:13.0294431Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:13.0303775Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:13.0304134Z env:
2025-05-07T20:23:13.0304357Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:13.0304673Z   BUILD_ENV: build_binary
2025-05-07T20:23:13.0304926Z   BUILD_TARGET: genai
2025-05-07T20:23:13.0305155Z   BUILD_VARIANT: cuda
2025-05-07T20:23:13.0305444Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:13.0305710Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:13.0306014Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:13.0306352Z ##[endgroup]
2025-05-07T20:23:13.3657103Z ################################################################################
2025-05-07T20:23:13.3657484Z [INFO] Printing general display info ...
2025-05-07T20:23:13.3688660Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:13.4789268Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:13.4798043Z /usr/bin/sudo
2025-05-07T20:23:13.4808680Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:13.4819368Z /usr/bin/yum
2025-05-07T20:23:13.4821050Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:13.4840894Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:13.9006286Z Last metadata expiration check: 0:00:08 ago on Wed May  7 20:23:05 2025.
2025-05-07T20:23:13.9740443Z ================================================================================
2025-05-07T20:23:13.9741210Z WARNING:
2025-05-07T20:23:13.9741742Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:13.9742273Z 
2025-05-07T20:23:13.9742454Z   Available Versions:
2025-05-07T20:23:13.9742760Z 
2025-05-07T20:23:13.9742933Z   Version 2023.7.20250331:
2025-05-07T20:23:13.9743553Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:13.9744081Z 
2025-05-07T20:23:13.9744345Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:13.9744616Z 
2025-05-07T20:23:13.9744701Z     Release notes:
2025-05-07T20:23:13.9745111Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:13.9745483Z 
2025-05-07T20:23:13.9745582Z   Version 2023.7.20250414:
2025-05-07T20:23:13.9745890Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:13.9746146Z 
2025-05-07T20:23:13.9746263Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:13.9746479Z 
2025-05-07T20:23:13.9746564Z     Release notes:
2025-05-07T20:23:13.9746964Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:13.9747327Z 
2025-05-07T20:23:13.9747427Z   Version 2023.7.20250428:
2025-05-07T20:23:13.9747739Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:13.9748213Z 
2025-05-07T20:23:13.9748330Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:13.9748543Z 
2025-05-07T20:23:13.9748634Z     Release notes:
2025-05-07T20:23:13.9749024Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:13.9749393Z 
2025-05-07T20:23:13.9749502Z ================================================================================
2025-05-07T20:23:14.0895948Z Dependencies resolved.
2025-05-07T20:23:14.1184684Z ================================================================================
2025-05-07T20:23:14.1185094Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:14.1185495Z ================================================================================
2025-05-07T20:23:14.1185804Z Upgrading:
2025-05-07T20:23:14.1186172Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:14.1186761Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:14.1187133Z 
2025-05-07T20:23:14.1187435Z Transaction Summary
2025-05-07T20:23:14.1187694Z ================================================================================
2025-05-07T20:23:14.1187996Z Upgrade  2 Packages
2025-05-07T20:23:14.1188144Z 
2025-05-07T20:23:14.1188276Z Total download size: 6.9 M
2025-05-07T20:23:14.1189917Z Downloading Packages:
2025-05-07T20:23:14.1605779Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  31 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:14.3385371Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  26 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:14.3395266Z --------------------------------------------------------------------------------
2025-05-07T20:23:14.3396394Z Total                                            31 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:14.3398859Z Running transaction check
2025-05-07T20:23:14.3493974Z Transaction check succeeded.
2025-05-07T20:23:14.3494285Z Running transaction test
2025-05-07T20:23:14.3788176Z Transaction test succeeded.
2025-05-07T20:23:14.3790920Z Running transaction
2025-05-07T20:23:14.9307673Z   Preparing        :                                                        1/1 
2025-05-07T20:23:15.0363690Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:15.0390226Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:15.0587005Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:15.0587688Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:15.0698618Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:15.0723482Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:15.2174719Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:15.2175318Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:15.2175954Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:15.2176491Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:15.3631642Z ================================================================================
2025-05-07T20:23:15.3632004Z WARNING:
2025-05-07T20:23:15.3632258Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:15.3632497Z 
2025-05-07T20:23:15.3632591Z   Available Versions:
2025-05-07T20:23:15.3632745Z 
2025-05-07T20:23:15.3632844Z   Version 2023.7.20250331:
2025-05-07T20:23:15.3633162Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:15.3633422Z 
2025-05-07T20:23:15.3633550Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:15.3633761Z 
2025-05-07T20:23:15.3633857Z     Release notes:
2025-05-07T20:23:15.3634267Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:15.3634904Z 
2025-05-07T20:23:15.3635008Z   Version 2023.7.20250414:
2025-05-07T20:23:15.3635322Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:15.3635570Z 
2025-05-07T20:23:15.3635696Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:15.3635905Z 
2025-05-07T20:23:15.3636001Z     Release notes:
2025-05-07T20:23:15.3636399Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:15.3636768Z 
2025-05-07T20:23:15.3636859Z   Version 2023.7.20250428:
2025-05-07T20:23:15.3637172Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:15.3637420Z 
2025-05-07T20:23:15.3637539Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:15.3637753Z 
2025-05-07T20:23:15.3637839Z     Release notes:
2025-05-07T20:23:15.3638236Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:15.3638601Z 
2025-05-07T20:23:15.3638923Z ================================================================================
2025-05-07T20:23:15.4193169Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:15.4193585Z 
2025-05-07T20:23:15.4193671Z Upgraded:
2025-05-07T20:23:15.4194057Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:15.4194725Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:15.4195127Z 
2025-05-07T20:23:15.4195209Z Complete!
2025-05-07T20:23:15.4668737Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:15.4691703Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:15.8951794Z Last metadata expiration check: 0:00:10 ago on Wed May  7 20:23:05 2025.
2025-05-07T20:23:15.9190453Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:15.9597901Z Dependencies resolved.
2025-05-07T20:23:15.9775832Z ================================================================================
2025-05-07T20:23:15.9776291Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:15.9776715Z ================================================================================
2025-05-07T20:23:15.9777006Z Installing:
2025-05-07T20:23:15.9777303Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:15.9777572Z 
2025-05-07T20:23:15.9777670Z Transaction Summary
2025-05-07T20:23:15.9777915Z ================================================================================
2025-05-07T20:23:15.9778324Z Install  1 Package
2025-05-07T20:23:15.9778469Z 
2025-05-07T20:23:15.9778596Z Total download size: 319 k
2025-05-07T20:23:15.9779444Z Installed size: 837 k
2025-05-07T20:23:15.9781180Z Downloading Packages:
2025-05-07T20:23:16.0501875Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        7.4 MB/s | 319 kB     00:00    
2025-05-07T20:23:16.0507492Z --------------------------------------------------------------------------------
2025-05-07T20:23:16.0510476Z Total                                           4.3 MB/s | 319 kB     00:00     
2025-05-07T20:23:16.0664964Z Running transaction check
2025-05-07T20:23:16.0719809Z Transaction check succeeded.
2025-05-07T20:23:16.0720752Z Running transaction test
2025-05-07T20:23:16.1184426Z Transaction test succeeded.
2025-05-07T20:23:16.1187698Z Running transaction
2025-05-07T20:23:16.2238166Z   Preparing        :                                                        1/1 
2025-05-07T20:23:16.2769460Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:16.4931286Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:16.6146443Z ================================================================================
2025-05-07T20:23:16.6146823Z WARNING:
2025-05-07T20:23:16.6147077Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:16.6147608Z 
2025-05-07T20:23:16.6147711Z   Available Versions:
2025-05-07T20:23:16.6147877Z 
2025-05-07T20:23:16.6147978Z   Version 2023.7.20250331:
2025-05-07T20:23:16.6148296Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:16.6148548Z 
2025-05-07T20:23:16.6148680Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:16.6148889Z 
2025-05-07T20:23:16.6148979Z     Release notes:
2025-05-07T20:23:16.6149384Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:16.6149756Z 
2025-05-07T20:23:16.6149848Z   Version 2023.7.20250414:
2025-05-07T20:23:16.6150164Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:16.6150413Z 
2025-05-07T20:23:16.6150529Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:16.6150742Z 
2025-05-07T20:23:16.6150828Z     Release notes:
2025-05-07T20:23:16.6151222Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:16.6151590Z 
2025-05-07T20:23:16.6151867Z   Version 2023.7.20250428:
2025-05-07T20:23:16.6152175Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:16.6152430Z 
2025-05-07T20:23:16.6152546Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:16.6152754Z 
2025-05-07T20:23:16.6152845Z     Release notes:
2025-05-07T20:23:16.6153235Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:16.6153603Z 
2025-05-07T20:23:16.6153715Z ================================================================================
2025-05-07T20:23:16.6490885Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:16.6491226Z 
2025-05-07T20:23:16.6491315Z Installed:
2025-05-07T20:23:16.6491635Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:16.6491931Z 
2025-05-07T20:23:16.6492022Z Complete!
2025-05-07T20:23:16.6940252Z + hostname
2025-05-07T20:23:16.6940421Z 
2025-05-07T20:23:16.6954402Z ip-10-0-69-200.ec2.internal
2025-05-07T20:23:16.6956341Z 
2025-05-07T20:23:16.6956925Z + sudo lshw -C display
2025-05-07T20:23:16.6957095Z 
2025-05-07T20:23:17.1156919Z   *-display:0 UNCLAIMED
2025-05-07T20:23:17.1157408Z        description: VGA compatible controller
2025-05-07T20:23:17.1157903Z        product: Amazon.com, Inc.
2025-05-07T20:23:17.1158335Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:17.1158742Z        physical id: 3
2025-05-07T20:23:17.1159110Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:17.1159518Z        version: 00
2025-05-07T20:23:17.1159849Z        width: 32 bits
2025-05-07T20:23:17.1160189Z        clock: 33MHz
2025-05-07T20:23:17.1160582Z        capabilities: vga_controller bus_master
2025-05-07T20:23:17.1161084Z        configuration: latency=0
2025-05-07T20:23:17.1161597Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:17.1162140Z   *-display:1
2025-05-07T20:23:17.1162521Z        description: 3D controller
2025-05-07T20:23:17.1162974Z        product: GA102GL [A10G]
2025-05-07T20:23:17.1163393Z        vendor: NVIDIA Corporation
2025-05-07T20:23:17.1163792Z        physical id: 1e
2025-05-07T20:23:17.1164127Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:17.1164481Z        version: a1
2025-05-07T20:23:17.1164787Z        width: 64 bits
2025-05-07T20:23:17.1165105Z        clock: 33MHz
2025-05-07T20:23:17.1165526Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:17.1166120Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:17.1167076Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:17.1200236Z 
2025-05-07T20:23:17.1200658Z ################################################################################
2025-05-07T20:23:17.1200994Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:17.1329692Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:17.1498963Z Wed May  7 20:23:17 2025       
2025-05-07T20:23:17.1499487Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:17.1500013Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:17.1500501Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:17.1500989Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:17.1501508Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:17.1501935Z |                                         |                        |               MIG M. |
2025-05-07T20:23:17.1502263Z |=========================================+========================+======================|
2025-05-07T20:23:17.1583548Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:17.1584231Z |  0%   30C    P0             60W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:17.1584613Z |                                         |                        |                  N/A |
2025-05-07T20:23:17.1584998Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:17.1585390Z                                                                                          
2025-05-07T20:23:17.1585768Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:17.1586192Z | Processes:                                                                              |
2025-05-07T20:23:17.1586628Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:17.1587039Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:17.1587384Z |=========================================================================================|
2025-05-07T20:23:17.1588487Z |  No running processes found                                                             |
2025-05-07T20:23:17.1588951Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:17.2973066Z ################################################################################
2025-05-07T20:23:17.2973446Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:17.3113504Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:17.3114394Z [CHECK] rocminfo not found
2025-05-07T20:23:17.3123395Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:17.3124493Z [CHECK] rocm-smi not found
2025-05-07T20:23:17.3187512Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:17.3187953Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:17.3200734Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:17.3201093Z env:
2025-05-07T20:23:17.3201333Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:17.3201647Z   BUILD_ENV: build_binary
2025-05-07T20:23:17.3201901Z   BUILD_TARGET: genai
2025-05-07T20:23:17.3202137Z   BUILD_VARIANT: cuda
2025-05-07T20:23:17.3202382Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:17.3202648Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:17.3202951Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:17.3203292Z ##[endgroup]
2025-05-07T20:23:17.6545927Z ################################################################################
2025-05-07T20:23:17.6546287Z # Setup Miniconda
2025-05-07T20:23:17.6546507Z #
2025-05-07T20:23:17.6561209Z # [2025-05-07T20:23:17.655Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:17.6561624Z ################################################################################
2025-05-07T20:23:17.6561841Z 
2025-05-07T20:23:17.6577336Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:17.7461464Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:17.7461828Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:17.7462032Z 
2025-05-07T20:23:17.7478265Z 
2025-05-07T20:23:17.7478581Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:17.7499527Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:18.5997697Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:18.5998125Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:18.5998390Z 
2025-05-07T20:23:18.6142319Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:19.0664661Z Unpacking payload ...
2025-05-07T20:23:19.5850654Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:20.3832730Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:22.4856892Z 
2025-05-07T20:23:22.4857593Z Installing base environment...
2025-05-07T20:23:22.4857905Z 
2025-05-07T20:23:23.5659360Z Preparing transaction: ...working... done
2025-05-07T20:23:26.5532420Z Executing transaction: ...working... done
2025-05-07T20:23:27.2100444Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:27.2984074Z installation finished.
2025-05-07T20:23:27.2992504Z 
2025-05-07T20:23:27.2992729Z + rm -f miniconda.sh
2025-05-07T20:23:27.2992915Z 
2025-05-07T20:23:27.3294869Z 
2025-05-07T20:23:27.3295237Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:23:27.3295732Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:23:27.6945897Z 
2025-05-07T20:23:27.6946195Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:23:27.6946746Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:23:27.6947255Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:23:27.6947628Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:23:27.6947997Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:23:27.6948392Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:23:27.6948828Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:23:27.6949274Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:23:27.6949736Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:23:27.6950516Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:23:27.6951047Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:23:27.6951426Z modified      /home/ec2-user/.bashrc
2025-05-07T20:23:27.6951619Z 
2025-05-07T20:23:27.6951817Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:23:27.6952121Z 
2025-05-07T20:23:27.7597578Z 
2025-05-07T20:23:27.7598067Z + . /home/ec2-user/.bashrc
2025-05-07T20:23:27.7598350Z 
2025-05-07T20:23:28.5881212Z 
2025-05-07T20:23:28.5881944Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:23:28.5906895Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:23:41.9849225Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:23:43.5317991Z Solving environment: - \ | / - \ | / - \ | / done
2025-05-07T20:23:43.6288550Z 
2025-05-07T20:23:43.6288826Z ## Package Plan ##
2025-05-07T20:23:43.6288981Z 
2025-05-07T20:23:43.6290921Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:43.6291193Z 
2025-05-07T20:23:43.6291291Z   added / updated specs:
2025-05-07T20:23:43.6291561Z     - conda-libmamba-solver
2025-05-07T20:23:43.6291821Z     - libarchive
2025-05-07T20:23:43.6292038Z     - libmamba
2025-05-07T20:23:43.6292289Z     - libmambapy
2025-05-07T20:23:43.6292422Z 
2025-05-07T20:23:43.6292426Z 
2025-05-07T20:23:43.6292565Z The following packages will be downloaded:
2025-05-07T20:23:43.6292783Z 
2025-05-07T20:23:43.6292898Z     package                    |            build
2025-05-07T20:23:43.6293221Z     ---------------------------|-----------------
2025-05-07T20:23:43.6293640Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:23:43.6294117Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:23:43.6294542Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:23:43.6295023Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:23:43.6295473Z     ------------------------------------------------------------
2025-05-07T20:23:43.6295809Z                                            Total:         1.4 MB
2025-05-07T20:23:43.6296025Z 
2025-05-07T20:23:43.6296139Z The following packages will be UPDATED:
2025-05-07T20:23:43.6296356Z 
2025-05-07T20:23:43.6300600Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:23:43.6301395Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:23:43.6301779Z 
2025-05-07T20:23:43.6302003Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:23:43.6302355Z 
2025-05-07T20:23:43.6302700Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:23:43.6303510Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:23:43.6303996Z 
2025-05-07T20:23:43.6304006Z 
2025-05-07T20:23:43.6304010Z 
2025-05-07T20:23:43.6304155Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:43.6304535Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:23:43.6304757Z 
2025-05-07T20:23:43.6310233Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:23:43.6310502Z 
2025-05-07T20:23:43.6312504Z 
2025-05-07T20:23:43.6325858Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:23:43.6326138Z 
2025-05-07T20:23:43.6326150Z 
2025-05-07T20:23:43.6326154Z 
2025-05-07T20:23:43.7083348Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:23:43.7099910Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.7100297Z 
2025-05-07T20:23:43.7244925Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.7245181Z 
2025-05-07T20:23:43.7393293Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.7393549Z 
2025-05-07T20:23:43.7393556Z 
2025-05-07T20:23:43.7441176Z ca-certificates-2025 | 149 KB    | #          |  11% [A[A
2025-05-07T20:23:43.7441446Z 
2025-05-07T20:23:43.7442074Z 
2025-05-07T20:23:43.7552799Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.7553067Z 
2025-05-07T20:23:43.7553070Z 
2025-05-07T20:23:43.7628810Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.7629256Z 
2025-05-07T20:23:43.7629264Z 
2025-05-07T20:23:43.7629269Z 
2025-05-07T20:23:43.7640334Z conda-libmamba-solve | 41 KB     | ###9       |  39% [A[A[A
2025-05-07T20:23:43.7640720Z 
2025-05-07T20:23:43.7640724Z 
2025-05-07T20:23:43.7640728Z 
2025-05-07T20:23:43.7813379Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.7813784Z 
2025-05-07T20:23:43.7813788Z 
2025-05-07T20:23:43.7813792Z 
2025-05-07T20:23:43.8448377Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.8448795Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.8454906Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.8455258Z                                                      
2025-05-07T20:23:43.8455456Z 
2025-05-07T20:23:43.8455854Z                                                      [A
2025-05-07T20:23:43.8456078Z 
2025-05-07T20:23:43.8456105Z 
2025-05-07T20:23:43.8456292Z                                                      [A[A
2025-05-07T20:23:43.8456511Z 
2025-05-07T20:23:43.8456514Z 
2025-05-07T20:23:43.8456529Z 
2025-05-07T20:23:43.8457446Z                                                      [A[A[A done
2025-05-07T20:23:43.9459730Z Preparing transaction: \ done
2025-05-07T20:23:44.0464850Z Verifying transaction: / done
2025-05-07T20:23:45.3483436Z Executing transaction: \ | / - \ | / - \ | / - \ done
2025-05-07T20:23:47.0518528Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:23:47.0542901Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:23:47.9829647Z Channels:
2025-05-07T20:23:47.9829984Z  - defaults
2025-05-07T20:23:47.9830288Z Platform: linux-64
2025-05-07T20:23:49.1893077Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:23:49.3061892Z Solving environment: - \ Channels:
2025-05-07T20:23:49.3062354Z  - defaults
2025-05-07T20:23:49.3062663Z Platform: linux-64
2025-05-07T20:23:49.5995428Z Collecting package metadata (repodata.json): / - \ | done
2025-05-07T20:23:49.8151296Z Solving environment: - \ | / done
2025-05-07T20:23:49.8981901Z done
2025-05-07T20:23:49.9639742Z 
2025-05-07T20:23:49.9640147Z ## Package Plan ##
2025-05-07T20:23:49.9640456Z 
2025-05-07T20:23:49.9640772Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:49.9641250Z 
2025-05-07T20:23:49.9641456Z   added / updated specs:
2025-05-07T20:23:49.9641947Z     - conda
2025-05-07T20:23:49.9642195Z 
2025-05-07T20:23:49.9642203Z 
2025-05-07T20:23:49.9642452Z The following packages will be downloaded:
2025-05-07T20:23:49.9642890Z 
2025-05-07T20:23:49.9643135Z     package                    |            build
2025-05-07T20:23:49.9643730Z     ---------------------------|-----------------
2025-05-07T20:23:49.9644132Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:23:49.9644808Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:23:49.9645197Z     ------------------------------------------------------------
2025-05-07T20:23:49.9645539Z                                            Total:         1.4 MB
2025-05-07T20:23:49.9645753Z 
2025-05-07T20:23:49.9645871Z The following packages will be UPDATED:
2025-05-07T20:23:49.9646131Z 
2025-05-07T20:23:49.9646525Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:23:49.9647051Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:23:49.9647298Z 
2025-05-07T20:23:49.9647302Z 
2025-05-07T20:23:49.9647307Z 
2025-05-07T20:23:49.9647454Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:49.9647836Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:23:49.9648058Z 
2025-05-07T20:23:50.0068026Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:23:50.0517128Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.0517401Z 
2025-05-07T20:23:50.1666886Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.1668109Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.2239127Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.2239368Z 
2025-05-07T20:23:50.2239854Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.2240090Z 
2025-05-07T20:23:50.2245305Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.2245649Z                                                      
2025-05-07T20:23:50.2245850Z 
2025-05-07T20:23:50.2246018Z                                                      [A done
2025-05-07T20:23:50.3248732Z Preparing transaction: \ done
2025-05-07T20:23:50.4255020Z Verifying transaction: / done
2025-05-07T20:23:52.4334244Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:23:53.0639557Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:23:53.0643496Z + conda clean --packages --tarball -y
2025-05-07T20:23:53.0643723Z 
2025-05-07T20:23:54.0684327Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:23:54.0684677Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:23:54.1319409Z 
2025-05-07T20:23:54.1328465Z + conda clean --all -y
2025-05-07T20:23:54.1328647Z 
2025-05-07T20:23:54.6709792Z There are no unused tarball(s) to remove.
2025-05-07T20:23:54.6710156Z Will remove 1 index cache(s).
2025-05-07T20:23:54.6710451Z There are no unused package(s) to remove.
2025-05-07T20:23:54.6710769Z There are no tempfile(s) to remove.
2025-05-07T20:23:54.6711061Z There are no logfile(s) to remove.
2025-05-07T20:23:54.7327498Z 
2025-05-07T20:23:54.7332780Z + conda info
2025-05-07T20:23:54.7332950Z 
2025-05-07T20:23:55.5017812Z 
2025-05-07T20:23:55.5018565Z      active environment : base
2025-05-07T20:23:55.5019082Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:23:55.5019545Z             shell level : 1
2025-05-07T20:23:55.5019837Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:23:55.5020236Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:23:55.5020616Z           conda version : 25.3.1
2025-05-07T20:23:55.5020892Z     conda-build version : not installed
2025-05-07T20:23:55.5021196Z          python version : 3.13.2.final.0
2025-05-07T20:23:55.5021497Z                  solver : libmamba (default)
2025-05-07T20:23:55.5021801Z        virtual packages : __archspec=1=zen2
2025-05-07T20:23:55.5022100Z                           __conda=25.3.1=0
2025-05-07T20:23:55.5022377Z                           __cuda=12.8=0
2025-05-07T20:23:55.5022652Z                           __glibc=2.34=0
2025-05-07T20:23:55.5022923Z                           __linux=6.1.130=0
2025-05-07T20:23:55.5023197Z                           __unix=0=0
2025-05-07T20:23:55.5023876Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:23:55.5024293Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:23:55.5024642Z   conda av metadata url : None
2025-05-07T20:23:55.5025012Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:23:55.5025440Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:23:55.5025828Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:23:55.5026203Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:23:55.5026581Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:23:55.5026916Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:23:55.5027256Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:23:55.5027593Z                           /home/ec2-user/.conda/envs
2025-05-07T20:23:55.5027889Z                platform : linux-64
2025-05-07T20:23:55.5028731Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:23:55.5029701Z                 UID:GID : 1000:1000
2025-05-07T20:23:55.5029972Z              netrc file : None
2025-05-07T20:23:55.5030227Z            offline mode : False
2025-05-07T20:23:55.5030401Z 
2025-05-07T20:23:55.5686879Z 
2025-05-07T20:23:55.5687303Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:23:55.5688026Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_f43e8320-63a0-46a8-acf5-3813a231fef4 ...
2025-05-07T20:23:55.5690271Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:23:55.5860285Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.10
2025-05-07T20:23:55.5860788Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.10[0m
2025-05-07T20:23:55.5878557Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:55.5878912Z env:
2025-05-07T20:23:55.5879136Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:55.5879444Z   BUILD_ENV: build_binary
2025-05-07T20:23:55.5879707Z   BUILD_TARGET: genai
2025-05-07T20:23:55.5879938Z   BUILD_VARIANT: cuda
2025-05-07T20:23:55.5880169Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:55.5880430Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:55.5880737Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:55.5881065Z ##[endgroup]
2025-05-07T20:23:55.9259010Z ################################################################################
2025-05-07T20:23:55.9259398Z # Create Conda Environment
2025-05-07T20:23:55.9259648Z #
2025-05-07T20:23:55.9275626Z # [2025-05-07T20:23:55.927Z] + create_conda_environment build_binary 3.10
2025-05-07T20:23:55.9276053Z ################################################################################
2025-05-07T20:23:55.9284082Z 
2025-05-07T20:23:55.9290759Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:56.0221349Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:56.0221739Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:23:56.0222085Z + conda info --envs
2025-05-07T20:23:56.0222227Z 
2025-05-07T20:23:56.7673748Z 
2025-05-07T20:23:56.7674235Z # conda environments:
2025-05-07T20:23:56.7674534Z #
2025-05-07T20:23:56.7674767Z base                   /home/ec2-user/miniconda
2025-05-07T20:23:56.7674996Z 
2025-05-07T20:23:56.8327488Z 
2025-05-07T20:23:56.8328129Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:23:58.4608146Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:23:58.4608434Z 
2025-05-07T20:23:58.4624511Z 
2025-05-07T20:23:58.4633736Z [SETUP] Creating new Conda environment (Python 3.10) ...
2025-05-07T20:23:58.4656888Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.10
2025-05-07T20:23:59.2270562Z Channels:
2025-05-07T20:23:59.2270812Z  - defaults
2025-05-07T20:23:59.2271034Z Platform: linux-64
2025-05-07T20:24:00.7558276Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done
2025-05-07T20:24:00.8564017Z Solving environment: / done
2025-05-07T20:24:00.8906059Z 
2025-05-07T20:24:00.8906371Z ## Package Plan ##
2025-05-07T20:24:00.8906561Z 
2025-05-07T20:24:00.8906810Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:00.8907141Z 
2025-05-07T20:24:00.8907240Z   added / updated specs:
2025-05-07T20:24:00.8907495Z     - python=3.10
2025-05-07T20:24:00.8907631Z 
2025-05-07T20:24:00.8907635Z 
2025-05-07T20:24:00.8907758Z The following packages will be downloaded:
2025-05-07T20:24:00.8907988Z 
2025-05-07T20:24:00.8908130Z     package                    |            build
2025-05-07T20:24:00.8908458Z     ---------------------------|-----------------
2025-05-07T20:24:00.8908814Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:00.8909223Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:00.8909651Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:00.8910069Z     python-3.10.16             |       he870216_1        26.9 MB
2025-05-07T20:24:00.8910852Z     setuptools-78.1.1          |  py310h06a4308_0         1.7 MB
2025-05-07T20:24:00.8911255Z     wheel-0.45.1               |  py310h06a4308_0         115 KB
2025-05-07T20:24:00.8911624Z     ------------------------------------------------------------
2025-05-07T20:24:00.8911957Z                                            Total:        28.8 MB
2025-05-07T20:24:00.8912170Z 
2025-05-07T20:24:00.8912301Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:00.8912534Z 
2025-05-07T20:24:00.8912950Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:00.8913405Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:00.8913825Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:00.8914311Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:00.8914859Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:00.8915333Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:00.8915760Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:00.8916209Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:00.8916686Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:00.8917191Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:00.8917613Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:00.8918038Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:00.8918450Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:00.8918849Z   python             pkgs/main/linux-64::python-3.10.16-he870216_1 
2025-05-07T20:24:00.8919280Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:00.8919755Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py310h06a4308_0 
2025-05-07T20:24:00.8920232Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:00.8920614Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:00.8920998Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:00.8921420Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py310h06a4308_0 
2025-05-07T20:24:00.8921815Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:00.8922187Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:00.8922434Z 
2025-05-07T20:24:00.8922438Z 
2025-05-07T20:24:00.8922443Z 
2025-05-07T20:24:00.8922589Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:00.8922970Z python-3.10.16       | 26.9 MB   |            |   0% 
2025-05-07T20:24:00.8923196Z 
2025-05-07T20:24:00.8923587Z setuptools-78.1.1    | 1.7 MB    |            |   0% [A
2025-05-07T20:24:00.8923830Z 
2025-05-07T20:24:00.8923838Z 
2025-05-07T20:24:00.8925967Z ca-certificates-2025 | 129 KB    |            |   0% [A[A
2025-05-07T20:24:00.8926236Z 
2025-05-07T20:24:00.8926240Z 
2025-05-07T20:24:00.8932118Z 
2025-05-07T20:24:00.8953805Z wheel-0.45.1         | 115 KB    |            |   0% [A[A[A
2025-05-07T20:24:00.8954058Z 
2025-05-07T20:24:00.8954062Z 
2025-05-07T20:24:00.8954065Z 
2025-05-07T20:24:00.8955064Z 
2025-05-07T20:24:00.8965647Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:24:00.8966061Z 
2025-05-07T20:24:00.8966066Z 
2025-05-07T20:24:00.8966080Z 
2025-05-07T20:24:00.8966093Z 
2025-05-07T20:24:00.8966096Z 
2025-05-07T20:24:00.9379315Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:24:00.9379597Z 
2025-05-07T20:24:00.9380272Z 
2025-05-07T20:24:00.9502603Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:00.9503016Z 
2025-05-07T20:24:00.9555462Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:00.9556175Z 
2025-05-07T20:24:00.9556180Z 
2025-05-07T20:24:00.9588643Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:00.9588991Z 
2025-05-07T20:24:00.9588997Z 
2025-05-07T20:24:00.9589002Z 
2025-05-07T20:24:00.9589367Z 
2025-05-07T20:24:00.9649003Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:00.9649364Z 
2025-05-07T20:24:00.9649368Z 
2025-05-07T20:24:00.9649372Z 
2025-05-07T20:24:00.9649376Z 
2025-05-07T20:24:00.9650110Z 
2025-05-07T20:24:00.9797473Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:00.9797849Z 
2025-05-07T20:24:00.9797856Z 
2025-05-07T20:24:00.9797861Z 
2025-05-07T20:24:00.9881542Z wheel-0.45.1         | 115 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:00.9881884Z 
2025-05-07T20:24:00.9881888Z 
2025-05-07T20:24:00.9881892Z 
2025-05-07T20:24:00.9881895Z 
2025-05-07T20:24:00.9881899Z 
2025-05-07T20:24:00.9908773Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:01.0863752Z python-3.10.16       | 26.9 MB   | 7          |   8% 
2025-05-07T20:24:01.0864014Z 
2025-05-07T20:24:01.0864018Z 
2025-05-07T20:24:01.0864022Z 
2025-05-07T20:24:01.0864329Z 
2025-05-07T20:24:01.0869448Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.0869819Z 
2025-05-07T20:24:01.0869825Z 
2025-05-07T20:24:01.0869830Z 
2025-05-07T20:24:01.0872789Z 
2025-05-07T20:24:01.0909389Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.1143938Z python-3.10.16       | 26.9 MB   | ##2        |  23% 
2025-05-07T20:24:01.1144178Z 
2025-05-07T20:24:01.1144182Z 
2025-05-07T20:24:01.1144327Z 
2025-05-07T20:24:01.1148311Z wheel-0.45.1         | 115 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.1148653Z 
2025-05-07T20:24:01.1148657Z 
2025-05-07T20:24:01.1148661Z 
2025-05-07T20:24:01.1910100Z wheel-0.45.1         | 115 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.3477137Z python-3.10.16       | 26.9 MB   | #######8   |  79% 
2025-05-07T20:24:01.3595078Z python-3.10.16       | 26.9 MB   | ########## | 100% 
2025-05-07T20:24:01.3595322Z 
2025-05-07T20:24:01.3598115Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:01.3598615Z 
2025-05-07T20:24:01.9397548Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:01.9404327Z python-3.10.16       | 26.9 MB   | ########## | 100% 
2025-05-07T20:24:01.9404829Z                                                      
2025-05-07T20:24:01.9405046Z 
2025-05-07T20:24:01.9405263Z                                                      [A
2025-05-07T20:24:01.9405467Z 
2025-05-07T20:24:01.9405471Z 
2025-05-07T20:24:01.9405637Z                                                      [A[A
2025-05-07T20:24:01.9405857Z 
2025-05-07T20:24:01.9405861Z 
2025-05-07T20:24:01.9405864Z 
2025-05-07T20:24:01.9406033Z                                                      [A[A[A
2025-05-07T20:24:01.9406307Z 
2025-05-07T20:24:01.9406313Z 
2025-05-07T20:24:01.9406328Z 
2025-05-07T20:24:01.9406334Z 
2025-05-07T20:24:01.9406595Z                                                      [A[A[A[A
2025-05-07T20:24:01.9406837Z 
2025-05-07T20:24:01.9406841Z 
2025-05-07T20:24:01.9406845Z 
2025-05-07T20:24:01.9406848Z 
2025-05-07T20:24:01.9406852Z 
2025-05-07T20:24:01.9407060Z                                                      [A[A[A[A[A done
2025-05-07T20:24:02.1514109Z Preparing transaction: \ | done
2025-05-07T20:24:03.3220196Z Verifying transaction: - \ | / - \ | / - \ | done
2025-05-07T20:24:05.5441099Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:05.5945170Z #
2025-05-07T20:24:05.5945509Z # To activate this environment, use
2025-05-07T20:24:05.5945810Z #
2025-05-07T20:24:05.5946073Z #     $ conda activate build_binary
2025-05-07T20:24:05.5946443Z #
2025-05-07T20:24:05.5946749Z # To deactivate an active environment, use
2025-05-07T20:24:05.5947535Z #
2025-05-07T20:24:05.5947842Z #     $ conda deactivate
2025-05-07T20:24:05.5948097Z 
2025-05-07T20:24:05.6985478Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:05.7006847Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:08.6473259Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (25.1)
2025-05-07T20:24:08.6473870Z Collecting pip
2025-05-07T20:24:08.6474202Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:08.6475024Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:08.6475860Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 54.3 MB/s eta 0:00:00
2025-05-07T20:24:08.6476229Z Installing collected packages: pip
2025-05-07T20:24:08.6476536Z   Attempting uninstall: pip
2025-05-07T20:24:08.6476832Z     Found existing installation: pip 25.1
2025-05-07T20:24:08.6477145Z     Uninstalling pip-25.1:
2025-05-07T20:24:08.6477435Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:08.6477770Z Successfully installed pip-25.1.1
2025-05-07T20:24:08.6477967Z 
2025-05-07T20:24:08.7106865Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:08.7129588Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:09.5653385Z Channels:
2025-05-07T20:24:09.5653647Z  - conda-forge
2025-05-07T20:24:09.5653900Z Platform: linux-64
2025-05-07T20:24:20.2413627Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:24:21.8440601Z Solving environment: \ | / - \ | done
2025-05-07T20:24:21.9042031Z 
2025-05-07T20:24:21.9042461Z ## Package Plan ##
2025-05-07T20:24:21.9042707Z 
2025-05-07T20:24:21.9042932Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:21.9043235Z 
2025-05-07T20:24:21.9043330Z   added / updated specs:
2025-05-07T20:24:21.9043609Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:21.9043836Z 
2025-05-07T20:24:21.9043840Z 
2025-05-07T20:24:21.9043963Z The following packages will be downloaded:
2025-05-07T20:24:21.9044178Z 
2025-05-07T20:24:21.9044299Z     package                    |            build
2025-05-07T20:24:21.9044614Z     ---------------------------|-----------------
2025-05-07T20:24:21.9044981Z     cffi-1.17.1                |  py310h8deb56e_0         238 KB  conda-forge
2025-05-07T20:24:21.9045428Z     cryptography-44.0.3        |  py310h6c63255_0         1.5 MB  conda-forge
2025-05-07T20:24:21.9045878Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:21.9046301Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:21.9046719Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:21.9047131Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:21.9047551Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:21.9047985Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:21.9048415Z     python_abi-3.10            |          2_cp310           4 KB  conda-forge
2025-05-07T20:24:21.9048865Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:21.9049359Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:21.9049786Z     ------------------------------------------------------------
2025-05-07T20:24:21.9050125Z                                            Total:         6.3 MB
2025-05-07T20:24:21.9050332Z 
2025-05-07T20:24:21.9050460Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:21.9050689Z 
2025-05-07T20:24:21.9050876Z   cffi               conda-forge/linux-64::cffi-1.17.1-py310h8deb56e_0 
2025-05-07T20:24:21.9051373Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py310h6c63255_0 
2025-05-07T20:24:21.9052263Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:21.9052712Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:21.9053191Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:21.9053656Z   python_abi         conda-forge/linux-64::python_abi-3.10-2_cp310 
2025-05-07T20:24:21.9055977Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:21.9056885Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:21.9057237Z 
2025-05-07T20:24:21.9057352Z The following packages will be UPDATED:
2025-05-07T20:24:21.9057557Z 
2025-05-07T20:24:21.9057953Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:21.9058841Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:21.9059483Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:21.9060103Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:21.9060462Z 
2025-05-07T20:24:21.9060472Z 
2025-05-07T20:24:21.9060476Z 
2025-05-07T20:24:21.9060618Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:21.9060987Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:24:21.9061217Z 
2025-05-07T20:24:21.9061662Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:24:21.9061920Z 
2025-05-07T20:24:21.9064969Z 
2025-05-07T20:24:21.9076464Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:24:21.9076825Z 
2025-05-07T20:24:21.9076830Z 
2025-05-07T20:24:21.9076836Z 
2025-05-07T20:24:21.9086369Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:24:21.9086718Z 
2025-05-07T20:24:21.9086736Z 
2025-05-07T20:24:21.9086741Z 
2025-05-07T20:24:21.9103190Z 
2025-05-07T20:24:21.9109351Z cffi-1.17.1          | 238 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:21.9109704Z 
2025-05-07T20:24:21.9109710Z 
2025-05-07T20:24:21.9109720Z 
2025-05-07T20:24:21.9109725Z 
2025-05-07T20:24:21.9109740Z 
2025-05-07T20:24:21.9114265Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:21.9114656Z 
2025-05-07T20:24:21.9114662Z 
2025-05-07T20:24:21.9114667Z 
2025-05-07T20:24:21.9114681Z 
2025-05-07T20:24:21.9114687Z 
2025-05-07T20:24:21.9114702Z 
2025-05-07T20:24:21.9123971Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:21.9124347Z 
2025-05-07T20:24:21.9124353Z 
2025-05-07T20:24:21.9124358Z 
2025-05-07T20:24:21.9124376Z 
2025-05-07T20:24:21.9124381Z 
2025-05-07T20:24:21.9124386Z 
2025-05-07T20:24:21.9126361Z 
2025-05-07T20:24:21.9130993Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:21.9131431Z 
2025-05-07T20:24:21.9131437Z 
2025-05-07T20:24:21.9131443Z 
2025-05-07T20:24:21.9131448Z 
2025-05-07T20:24:21.9131453Z 
2025-05-07T20:24:21.9131458Z 
2025-05-07T20:24:21.9131463Z 
2025-05-07T20:24:21.9131472Z 
2025-05-07T20:24:21.9133163Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.9133576Z 
2025-05-07T20:24:21.9133582Z 
2025-05-07T20:24:21.9133587Z 
2025-05-07T20:24:21.9133592Z 
2025-05-07T20:24:21.9133597Z 
2025-05-07T20:24:21.9133607Z 
2025-05-07T20:24:21.9133612Z 
2025-05-07T20:24:21.9133617Z 
2025-05-07T20:24:21.9133631Z 
2025-05-07T20:24:21.9134493Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.9134882Z 
2025-05-07T20:24:21.9134887Z 
2025-05-07T20:24:21.9134892Z 
2025-05-07T20:24:21.9134897Z 
2025-05-07T20:24:21.9134902Z 
2025-05-07T20:24:21.9134908Z 
2025-05-07T20:24:21.9134917Z 
2025-05-07T20:24:21.9134922Z 
2025-05-07T20:24:21.9134927Z 
2025-05-07T20:24:21.9134932Z 
2025-05-07T20:24:21.9829654Z python_abi-3.10      | 4 KB      |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.9834961Z 
2025-05-07T20:24:21.9891242Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:21.9891609Z 
2025-05-07T20:24:21.9891847Z 
2025-05-07T20:24:22.0116289Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:22.0116558Z 
2025-05-07T20:24:22.0116562Z 
2025-05-07T20:24:22.0116566Z 
2025-05-07T20:24:22.0205342Z libgomp-15.1.0       | 442 KB    | 3          |   4% [A[A[A
2025-05-07T20:24:22.0205684Z 
2025-05-07T20:24:22.0205932Z 
2025-05-07T20:24:22.0205939Z 
2025-05-07T20:24:22.0205943Z 
2025-05-07T20:24:22.0211863Z cffi-1.17.1          | 238 KB    | 6          |   7% [A[A[A[A
2025-05-07T20:24:22.0212205Z 
2025-05-07T20:24:22.0212209Z 
2025-05-07T20:24:22.0212213Z 
2025-05-07T20:24:22.0212216Z 
2025-05-07T20:24:22.0213258Z 
2025-05-07T20:24:22.0278124Z pyopenssl-25.0.0     | 120 KB    | #3         |  13% [A[A[A[A[A
2025-05-07T20:24:22.0278422Z 
2025-05-07T20:24:22.0278426Z 
2025-05-07T20:24:22.0278429Z 
2025-05-07T20:24:22.0278433Z 
2025-05-07T20:24:22.0283135Z 
2025-05-07T20:24:22.0352638Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:22.0353025Z 
2025-05-07T20:24:22.0353029Z 
2025-05-07T20:24:22.0353711Z 
2025-05-07T20:24:22.0402612Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:22.0402986Z 
2025-05-07T20:24:22.0402991Z 
2025-05-07T20:24:22.0402997Z 
2025-05-07T20:24:22.0403002Z 
2025-05-07T20:24:22.0403007Z 
2025-05-07T20:24:22.0403025Z 
2025-05-07T20:24:22.0416358Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:24:22.0449896Z openssl-3.5.0        | 3.0 MB    |            |   1% 
2025-05-07T20:24:22.0450241Z 
2025-05-07T20:24:22.0450247Z 
2025-05-07T20:24:22.0450252Z 
2025-05-07T20:24:22.0450256Z 
2025-05-07T20:24:22.0471627Z cffi-1.17.1          | 238 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:22.0471991Z 
2025-05-07T20:24:22.0472009Z 
2025-05-07T20:24:22.0472015Z 
2025-05-07T20:24:22.0472022Z 
2025-05-07T20:24:22.0472028Z 
2025-05-07T20:24:22.0472051Z 
2025-05-07T20:24:22.0690275Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:22.0690667Z 
2025-05-07T20:24:22.0690673Z 
2025-05-07T20:24:22.0690678Z 
2025-05-07T20:24:22.0690691Z 
2025-05-07T20:24:22.0690697Z 
2025-05-07T20:24:22.0690702Z 
2025-05-07T20:24:22.0692647Z 
2025-05-07T20:24:22.0738520Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:24:22.0739020Z 
2025-05-07T20:24:22.0739045Z 
2025-05-07T20:24:22.0739051Z 
2025-05-07T20:24:22.0739056Z 
2025-05-07T20:24:22.0739061Z 
2025-05-07T20:24:22.0739066Z 
2025-05-07T20:24:22.0741908Z 
2025-05-07T20:24:22.0874991Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:22.0875417Z 
2025-05-07T20:24:22.0875422Z 
2025-05-07T20:24:22.0882458Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:22.0882815Z 
2025-05-07T20:24:22.0882828Z 
2025-05-07T20:24:22.0919592Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:22.0919937Z 
2025-05-07T20:24:22.0919943Z 
2025-05-07T20:24:22.0919948Z 
2025-05-07T20:24:22.0919963Z 
2025-05-07T20:24:22.0919969Z 
2025-05-07T20:24:22.0919974Z 
2025-05-07T20:24:22.0919979Z 
2025-05-07T20:24:22.0919984Z 
2025-05-07T20:24:22.0919990Z 
2025-05-07T20:24:22.0921588Z 
2025-05-07T20:24:22.0947191Z python_abi-3.10      | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.0947603Z 
2025-05-07T20:24:22.0947609Z 
2025-05-07T20:24:22.0947614Z 
2025-05-07T20:24:22.0947619Z 
2025-05-07T20:24:22.0947624Z 
2025-05-07T20:24:22.0947630Z 
2025-05-07T20:24:22.0947635Z 
2025-05-07T20:24:22.0947640Z 
2025-05-07T20:24:22.0947645Z 
2025-05-07T20:24:22.0948482Z 
2025-05-07T20:24:22.1054112Z python_abi-3.10      | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.1054513Z 
2025-05-07T20:24:22.1054736Z 
2025-05-07T20:24:22.1054741Z 
2025-05-07T20:24:22.1054746Z 
2025-05-07T20:24:22.1054751Z 
2025-05-07T20:24:22.1054756Z 
2025-05-07T20:24:22.1054762Z 
2025-05-07T20:24:22.1054767Z 
2025-05-07T20:24:22.1054772Z 
2025-05-07T20:24:22.1073602Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.1073991Z 
2025-05-07T20:24:22.1073996Z 
2025-05-07T20:24:22.1074001Z 
2025-05-07T20:24:22.1074007Z 
2025-05-07T20:24:22.1074012Z 
2025-05-07T20:24:22.1074017Z 
2025-05-07T20:24:22.1074022Z 
2025-05-07T20:24:22.1074027Z 
2025-05-07T20:24:22.1098661Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.1098979Z 
2025-05-07T20:24:22.1098983Z 
2025-05-07T20:24:22.1098987Z 
2025-05-07T20:24:22.1098990Z 
2025-05-07T20:24:22.1098994Z 
2025-05-07T20:24:22.1098997Z 
2025-05-07T20:24:22.1099009Z 
2025-05-07T20:24:22.1099012Z 
2025-05-07T20:24:22.1100441Z 
2025-05-07T20:24:22.1136351Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.1136742Z 
2025-05-07T20:24:22.1136755Z 
2025-05-07T20:24:22.1136759Z 
2025-05-07T20:24:22.1136763Z 
2025-05-07T20:24:22.1136766Z 
2025-05-07T20:24:22.1136770Z 
2025-05-07T20:24:22.1136773Z 
2025-05-07T20:24:22.1136777Z 
2025-05-07T20:24:22.1194834Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.1195248Z 
2025-05-07T20:24:22.1195252Z 
2025-05-07T20:24:22.1195255Z 
2025-05-07T20:24:22.1195259Z 
2025-05-07T20:24:22.1198388Z 
2025-05-07T20:24:22.1417089Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:22.1885930Z openssl-3.5.0        | 3.0 MB    | #########8 |  98% 
2025-05-07T20:24:22.2033564Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:22.2033820Z 
2025-05-07T20:24:22.2033825Z 
2025-05-07T20:24:22.2033830Z 
2025-05-07T20:24:22.2038876Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:22.2039238Z 
2025-05-07T20:24:22.2039254Z 
2025-05-07T20:24:22.2039259Z 
2025-05-07T20:24:22.2076681Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:22.2077022Z 
2025-05-07T20:24:22.2077027Z 
2025-05-07T20:24:22.2077033Z 
2025-05-07T20:24:22.2077038Z 
2025-05-07T20:24:22.2081169Z cffi-1.17.1          | 238 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:22.2081518Z 
2025-05-07T20:24:22.2081524Z 
2025-05-07T20:24:22.2081529Z 
2025-05-07T20:24:22.2081534Z 
2025-05-07T20:24:22.2227182Z cffi-1.17.1          | 238 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:22.2227439Z 
2025-05-07T20:24:22.2227600Z 
2025-05-07T20:24:22.2227606Z 
2025-05-07T20:24:22.2227610Z 
2025-05-07T20:24:22.2227613Z 
2025-05-07T20:24:22.2227743Z 
2025-05-07T20:24:22.2227834Z 
2025-05-07T20:24:22.2230888Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:22.2231644Z 
2025-05-07T20:24:22.2231653Z 
2025-05-07T20:24:22.2231671Z 
2025-05-07T20:24:22.2231679Z 
2025-05-07T20:24:22.2231706Z 
2025-05-07T20:24:22.2231713Z 
2025-05-07T20:24:22.2231720Z 
2025-05-07T20:24:22.2320177Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:22.2321019Z 
2025-05-07T20:24:22.2321027Z 
2025-05-07T20:24:22.2321034Z 
2025-05-07T20:24:22.2321042Z 
2025-05-07T20:24:22.2321049Z 
2025-05-07T20:24:22.2321056Z 
2025-05-07T20:24:22.2321063Z 
2025-05-07T20:24:22.2321070Z 
2025-05-07T20:24:22.2321077Z 
2025-05-07T20:24:22.2321084Z 
2025-05-07T20:24:22.2776318Z python_abi-3.10      | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.2776734Z 
2025-05-07T20:24:22.2776740Z 
2025-05-07T20:24:22.2776746Z 
2025-05-07T20:24:22.2776751Z 
2025-05-07T20:24:22.2776756Z 
2025-05-07T20:24:22.2776761Z 
2025-05-07T20:24:22.2776766Z 
2025-05-07T20:24:22.2776772Z 
2025-05-07T20:24:22.2776777Z 
2025-05-07T20:24:22.2786130Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.2786530Z 
2025-05-07T20:24:22.2786535Z 
2025-05-07T20:24:22.2786767Z 
2025-05-07T20:24:22.2786773Z 
2025-05-07T20:24:22.2786778Z 
2025-05-07T20:24:22.2786783Z 
2025-05-07T20:24:22.2786788Z 
2025-05-07T20:24:22.2786793Z 
2025-05-07T20:24:22.2786799Z 
2025-05-07T20:24:22.3026025Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.3026411Z 
2025-05-07T20:24:22.3026416Z 
2025-05-07T20:24:22.3026422Z 
2025-05-07T20:24:22.3026427Z 
2025-05-07T20:24:22.3026432Z 
2025-05-07T20:24:22.3026437Z 
2025-05-07T20:24:22.3026442Z 
2025-05-07T20:24:22.3026448Z 
2025-05-07T20:24:22.3031408Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.3031833Z 
2025-05-07T20:24:22.3031838Z 
2025-05-07T20:24:22.3031843Z 
2025-05-07T20:24:22.3031849Z 
2025-05-07T20:24:22.3031854Z 
2025-05-07T20:24:22.3031859Z 
2025-05-07T20:24:22.3031864Z 
2025-05-07T20:24:22.3031871Z 
2025-05-07T20:24:22.3386673Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.3387109Z 
2025-05-07T20:24:22.3387114Z 
2025-05-07T20:24:22.3387120Z 
2025-05-07T20:24:22.3387125Z 
2025-05-07T20:24:22.3387130Z 
2025-05-07T20:24:22.3387144Z 
2025-05-07T20:24:22.3393704Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:22.3394098Z 
2025-05-07T20:24:22.3394104Z 
2025-05-07T20:24:22.3394109Z 
2025-05-07T20:24:22.3394115Z 
2025-05-07T20:24:22.3394128Z 
2025-05-07T20:24:22.3394133Z 
2025-05-07T20:24:22.3689621Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:22.3690021Z 
2025-05-07T20:24:22.3691978Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:22.3692382Z 
2025-05-07T20:24:22.4170261Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:22.4176816Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:22.4177309Z                                                      
2025-05-07T20:24:22.4177614Z 
2025-05-07T20:24:22.4177863Z                                                      [A
2025-05-07T20:24:22.4178245Z 
2025-05-07T20:24:22.4178251Z 
2025-05-07T20:24:22.4178489Z                                                      [A[A
2025-05-07T20:24:22.4178773Z 
2025-05-07T20:24:22.4178779Z 
2025-05-07T20:24:22.4178784Z 
2025-05-07T20:24:22.4179025Z                                                      [A[A[A
2025-05-07T20:24:22.4179268Z 
2025-05-07T20:24:22.4179272Z 
2025-05-07T20:24:22.4179276Z 
2025-05-07T20:24:22.4179279Z 
2025-05-07T20:24:22.4179484Z                                                      [A[A[A[A
2025-05-07T20:24:22.4179694Z 
2025-05-07T20:24:22.4179698Z 
2025-05-07T20:24:22.4179702Z 
2025-05-07T20:24:22.4179705Z 
2025-05-07T20:24:22.4179709Z 
2025-05-07T20:24:22.4179884Z                                                      [A[A[A[A[A
2025-05-07T20:24:22.4180102Z 
2025-05-07T20:24:22.4180106Z 
2025-05-07T20:24:22.4180109Z 
2025-05-07T20:24:22.4180113Z 
2025-05-07T20:24:22.4180117Z 
2025-05-07T20:24:22.4180120Z 
2025-05-07T20:24:22.4180347Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:22.4180633Z 
2025-05-07T20:24:22.4180636Z 
2025-05-07T20:24:22.4180640Z 
2025-05-07T20:24:22.4180644Z 
2025-05-07T20:24:22.4180647Z 
2025-05-07T20:24:22.4180651Z 
2025-05-07T20:24:22.4180655Z 
2025-05-07T20:24:22.4180842Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:22.4181087Z 
2025-05-07T20:24:22.4181091Z 
2025-05-07T20:24:22.4181094Z 
2025-05-07T20:24:22.4181098Z 
2025-05-07T20:24:22.4181102Z 
2025-05-07T20:24:22.4181110Z 
2025-05-07T20:24:22.4181113Z 
2025-05-07T20:24:22.4181117Z 
2025-05-07T20:24:22.4181305Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:22.4181540Z 
2025-05-07T20:24:22.4181544Z 
2025-05-07T20:24:22.4181547Z 
2025-05-07T20:24:22.4181551Z 
2025-05-07T20:24:22.4181554Z 
2025-05-07T20:24:22.4181558Z 
2025-05-07T20:24:22.4181561Z 
2025-05-07T20:24:22.4181565Z 
2025-05-07T20:24:22.4181569Z 
2025-05-07T20:24:22.4181951Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:22.4182189Z 
2025-05-07T20:24:22.4182192Z 
2025-05-07T20:24:22.4182196Z 
2025-05-07T20:24:22.4182199Z 
2025-05-07T20:24:22.4182203Z 
2025-05-07T20:24:22.4182206Z 
2025-05-07T20:24:22.4182210Z 
2025-05-07T20:24:22.4182213Z 
2025-05-07T20:24:22.4182217Z 
2025-05-07T20:24:22.4182221Z 
2025-05-07T20:24:22.4182425Z                                                      [A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:22.5186122Z Preparing transaction: - done
2025-05-07T20:24:22.6188925Z Verifying transaction: | done
2025-05-07T20:24:24.1213805Z Executing transaction: - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:24.2978531Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:24:26.0173640Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:24:26.0188128Z [SETUP] Installing libxcrypt ...
2025-05-07T20:24:26.0211691Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:24:26.8872281Z Channels:
2025-05-07T20:24:26.8872525Z  - conda-forge
2025-05-07T20:24:26.8872758Z Platform: linux-64
2025-05-07T20:24:30.2654168Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:30.6314219Z Solving environment: \ done
2025-05-07T20:24:30.6921947Z 
2025-05-07T20:24:30.6924832Z ## Package Plan ##
2025-05-07T20:24:30.6925072Z 
2025-05-07T20:24:30.6925287Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:30.6925626Z 
2025-05-07T20:24:30.6925726Z   added / updated specs:
2025-05-07T20:24:30.6925980Z     - libxcrypt
2025-05-07T20:24:30.6926113Z 
2025-05-07T20:24:30.6926118Z 
2025-05-07T20:24:30.6926244Z The following packages will be downloaded:
2025-05-07T20:24:30.6926470Z 
2025-05-07T20:24:30.6926587Z     package                    |            build
2025-05-07T20:24:30.6926914Z     ---------------------------|-----------------
2025-05-07T20:24:30.6927297Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:24:30.6927707Z     ------------------------------------------------------------
2025-05-07T20:24:30.6928048Z                                            Total:          98 KB
2025-05-07T20:24:30.6928255Z 
2025-05-07T20:24:30.6928390Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:30.6928612Z 
2025-05-07T20:24:30.6928831Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:24:30.6929125Z 
2025-05-07T20:24:30.6929135Z 
2025-05-07T20:24:30.6929139Z 
2025-05-07T20:24:30.6929282Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:30.8468569Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:24:30.8487080Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:24:30.8585755Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:30.8588564Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:30.8588910Z                                                      
2025-05-07T20:24:30.8589185Z  done
2025-05-07T20:24:30.9592038Z Preparing transaction: / done
2025-05-07T20:24:31.0596685Z Verifying transaction: \ done
2025-05-07T20:24:31.1601349Z Executing transaction: / done
2025-05-07T20:24:34.5960424Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:24:34.5961378Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.10/crypt.h
2025-05-07T20:24:34.5962053Z 
2025-05-07T20:24:34.5990125Z 
2025-05-07T20:24:36.2398060Z [SETUP] Installed Python version: Python 3.10.16
2025-05-07T20:24:36.2399301Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:24:36.2435798Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:24:36.2436255Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:24:36.2449895Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:36.2450429Z env:
2025-05-07T20:24:36.2450668Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:36.2450981Z   BUILD_ENV: build_binary
2025-05-07T20:24:36.2451229Z   BUILD_TARGET: genai
2025-05-07T20:24:36.2451464Z   BUILD_VARIANT: cuda
2025-05-07T20:24:36.2451704Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:24:36.2451961Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:36.2452270Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:36.2452605Z ##[endgroup]
2025-05-07T20:24:36.5814048Z ################################################################################
2025-05-07T20:24:36.5814559Z # Install C/C++ Compilers
2025-05-07T20:24:36.5814907Z #
2025-05-07T20:24:36.5831334Z # [2025-05-07T20:24:36.582Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:24:36.5831739Z ################################################################################
2025-05-07T20:24:36.5831951Z 
2025-05-07T20:24:36.5848797Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:36.6729659Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:36.6740510Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:24:36.6763712Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:24:37.5386749Z Channels:
2025-05-07T20:24:37.5387009Z  - conda-forge
2025-05-07T20:24:37.5387241Z Platform: linux-64
2025-05-07T20:24:40.8251811Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:41.1931521Z Solving environment: \ done
2025-05-07T20:24:41.2548821Z 
2025-05-07T20:24:41.2549035Z ## Package Plan ##
2025-05-07T20:24:41.2549260Z 
2025-05-07T20:24:41.2549556Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:41.2549965Z 
2025-05-07T20:24:41.2550091Z   added / updated specs:
2025-05-07T20:24:41.2550434Z     - sysroot_linux-64=2.17
2025-05-07T20:24:41.2550653Z 
2025-05-07T20:24:41.2550658Z 
2025-05-07T20:24:41.2550804Z The following packages will be downloaded:
2025-05-07T20:24:41.2551025Z 
2025-05-07T20:24:41.2551146Z     package                    |            build
2025-05-07T20:24:41.2551477Z     ---------------------------|-----------------
2025-05-07T20:24:41.2551904Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:24:41.2552404Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:24:41.2552965Z     ------------------------------------------------------------
2025-05-07T20:24:41.2553409Z                                            Total:        15.4 MB
2025-05-07T20:24:41.2553626Z 
2025-05-07T20:24:41.2553769Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:41.2554005Z 
2025-05-07T20:24:41.2554292Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:24:41.2554874Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:24:41.2555194Z 
2025-05-07T20:24:41.2555202Z 
2025-05-07T20:24:41.2555212Z 
2025-05-07T20:24:41.2555364Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:41.2556018Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:41.2556253Z 
2025-05-07T20:24:41.4642668Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:24:41.4650641Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:41.4650893Z 
2025-05-07T20:24:41.4746672Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:24:41.4748275Z 
2025-05-07T20:24:41.5643821Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:41.6324935Z sysroot_linux-64-2.1 | 14.5 MB   | ########9  |  90% 
2025-05-07T20:24:41.7271404Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:41.7271668Z 
2025-05-07T20:24:41.7272613Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:41.7272872Z 
2025-05-07T20:24:42.2048173Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:42.2051337Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:42.2052095Z                                                      
2025-05-07T20:24:42.2052527Z 
2025-05-07T20:24:42.2052961Z                                                      [A done
2025-05-07T20:24:42.3055368Z Preparing transaction: / done
2025-05-07T20:24:42.5066649Z Verifying transaction: \ | done
2025-05-07T20:24:42.7109631Z Executing transaction: - \ done
2025-05-07T20:24:42.8638043Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:24:42.8638376Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:24:44.5443291Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:24:44.5459186Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:24:44.5481817Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:24:45.4450373Z Channels:
2025-05-07T20:24:45.4450718Z  - conda-forge
2025-05-07T20:24:45.4451068Z Platform: linux-64
2025-05-07T20:24:48.6872572Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:49.6410113Z Solving environment: \ | / done
2025-05-07T20:24:49.7045632Z 
2025-05-07T20:24:49.7046008Z ## Package Plan ##
2025-05-07T20:24:49.7046264Z 
2025-05-07T20:24:49.7046525Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:49.7046852Z 
2025-05-07T20:24:49.7046951Z   added / updated specs:
2025-05-07T20:24:49.7047218Z     - gxx_linux-64=11.4.0
2025-05-07T20:24:49.7047381Z 
2025-05-07T20:24:49.7047384Z 
2025-05-07T20:24:49.7047519Z The following packages will be downloaded:
2025-05-07T20:24:49.7047767Z 
2025-05-07T20:24:49.7047886Z     package                    |            build
2025-05-07T20:24:49.7048211Z     ---------------------------|-----------------
2025-05-07T20:24:49.7048628Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:24:49.7049132Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:24:49.7049601Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:24:49.7050049Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:24:49.7050498Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:24:49.7050935Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:24:49.7051372Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:24:49.7051849Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:24:49.7052324Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:24:49.7052768Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:24:49.7053241Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:24:49.7053728Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:24:49.7054125Z     ------------------------------------------------------------
2025-05-07T20:24:49.7054466Z                                            Total:        91.6 MB
2025-05-07T20:24:49.7054683Z 
2025-05-07T20:24:49.7054814Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:49.7055040Z 
2025-05-07T20:24:49.7055319Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:24:49.7056137Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:24:49.7057116Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:24:49.7057649Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:24:49.7058289Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:24:49.7058949Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:24:49.7059525Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:49.7060101Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:24:49.7060603Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:24:49.7061144Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:49.7061512Z 
2025-05-07T20:24:49.7061631Z The following packages will be UPDATED:
2025-05-07T20:24:49.7061838Z 
2025-05-07T20:24:49.7062166Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:24:49.7062888Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:24:49.7063295Z 
2025-05-07T20:24:49.7063305Z 
2025-05-07T20:24:49.7063309Z 
2025-05-07T20:24:49.7063455Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:49.7063843Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:49.7064075Z 
2025-05-07T20:24:49.7064495Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:49.7064732Z 
2025-05-07T20:24:49.7064736Z 
2025-05-07T20:24:49.7074813Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:49.7075170Z 
2025-05-07T20:24:49.7075173Z 
2025-05-07T20:24:49.7075177Z 
2025-05-07T20:24:49.7092130Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:49.7092397Z 
2025-05-07T20:24:49.7092401Z 
2025-05-07T20:24:49.7092405Z 
2025-05-07T20:24:49.7092420Z 
2025-05-07T20:24:49.7112812Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:49.7113078Z 
2025-05-07T20:24:49.7113209Z 
2025-05-07T20:24:49.7113236Z 
2025-05-07T20:24:49.7113240Z 
2025-05-07T20:24:49.7118145Z 
2025-05-07T20:24:49.7118516Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:49.7118822Z 
2025-05-07T20:24:49.7118826Z 
2025-05-07T20:24:49.7118836Z 
2025-05-07T20:24:49.7118840Z 
2025-05-07T20:24:49.7118843Z 
2025-05-07T20:24:49.7118847Z 
2025-05-07T20:24:49.7121152Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:49.7121458Z 
2025-05-07T20:24:49.7121468Z 
2025-05-07T20:24:49.7121472Z 
2025-05-07T20:24:49.7121476Z 
2025-05-07T20:24:49.7121479Z 
2025-05-07T20:24:49.7121483Z 
2025-05-07T20:24:49.7121486Z 
2025-05-07T20:24:49.7126794Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:49.7127134Z 
2025-05-07T20:24:49.7127139Z 
2025-05-07T20:24:49.7127188Z 
2025-05-07T20:24:49.7127193Z 
2025-05-07T20:24:49.7127199Z 
2025-05-07T20:24:49.7127204Z 
2025-05-07T20:24:49.7127209Z 
2025-05-07T20:24:49.7127215Z 
2025-05-07T20:24:49.7127499Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:49.7127814Z 
2025-05-07T20:24:49.7127819Z 
2025-05-07T20:24:49.7127824Z 
2025-05-07T20:24:49.7127828Z 
2025-05-07T20:24:49.7127833Z 
2025-05-07T20:24:49.7127837Z 
2025-05-07T20:24:49.7127842Z 
2025-05-07T20:24:49.7127846Z 
2025-05-07T20:24:49.7131727Z 
2025-05-07T20:24:49.7133301Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.7133662Z 
2025-05-07T20:24:49.7133668Z 
2025-05-07T20:24:49.7133674Z 
2025-05-07T20:24:49.7133679Z 
2025-05-07T20:24:49.7133684Z 
2025-05-07T20:24:49.7133689Z 
2025-05-07T20:24:49.7133695Z 
2025-05-07T20:24:49.7133699Z 
2025-05-07T20:24:49.7133704Z 
2025-05-07T20:24:49.7133710Z 
2025-05-07T20:24:49.7138752Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.7139057Z 
2025-05-07T20:24:49.7139061Z 
2025-05-07T20:24:49.7139064Z 
2025-05-07T20:24:49.7139068Z 
2025-05-07T20:24:49.7139072Z 
2025-05-07T20:24:49.7139075Z 
2025-05-07T20:24:49.7139079Z 
2025-05-07T20:24:49.7139189Z 
2025-05-07T20:24:49.7139193Z 
2025-05-07T20:24:49.7139197Z 
2025-05-07T20:24:49.7139200Z 
2025-05-07T20:24:49.8344234Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.8344568Z 
2025-05-07T20:24:49.8344572Z 
2025-05-07T20:24:49.8350430Z 
2025-05-07T20:24:49.8350435Z 
2025-05-07T20:24:49.8367350Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:49.8367743Z 
2025-05-07T20:24:49.9350841Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:49.9351117Z 
2025-05-07T20:24:49.9351121Z 
2025-05-07T20:24:49.9351125Z 
2025-05-07T20:24:49.9356281Z 
2025-05-07T20:24:49.9461655Z libstdcxx-15.1.0     | 3.7 MB    | #8         |  19% [A[A[A[A
2025-05-07T20:24:49.9461986Z 
2025-05-07T20:24:49.9705905Z gxx_impl_linux-64-11 | 11.2 MB   | 3          |   4% [A
2025-05-07T20:24:49.9706306Z 
2025-05-07T20:24:49.9706313Z 
2025-05-07T20:24:50.0143346Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:50.0143675Z 
2025-05-07T20:24:50.0143679Z 
2025-05-07T20:24:50.0143682Z 
2025-05-07T20:24:50.0144411Z 
2025-05-07T20:24:50.0416397Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:50.0416687Z 
2025-05-07T20:24:50.0416691Z 
2025-05-07T20:24:50.0416904Z 
2025-05-07T20:24:50.0494478Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:50.0495476Z 
2025-05-07T20:24:50.0664797Z gxx_impl_linux-64-11 | 11.2 MB   | ##8        |  28% [A
2025-05-07T20:24:50.0665150Z 
2025-05-07T20:24:50.0665154Z 
2025-05-07T20:24:50.0665157Z 
2025-05-07T20:24:50.0665161Z 
2025-05-07T20:24:50.0665165Z 
2025-05-07T20:24:50.0705971Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:50.0706361Z 
2025-05-07T20:24:50.0706367Z 
2025-05-07T20:24:50.1339641Z libstdcxx-devel_linu | 11.1 MB   | ####4      |  45% [A[A
2025-05-07T20:24:50.1417988Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:50.1418375Z 
2025-05-07T20:24:50.1418460Z 
2025-05-07T20:24:50.1418490Z 
2025-05-07T20:24:50.1497485Z binutils_impl_linux- | 6.0 MB    | ######3    |  63% [A[A[A
2025-05-07T20:24:50.1500213Z 
2025-05-07T20:24:50.1938051Z gxx_impl_linux-64-11 | 11.2 MB   | #######4   |  74% [A
2025-05-07T20:24:50.1938537Z 
2025-05-07T20:24:50.1942184Z 
2025-05-07T20:24:50.2300785Z libstdcxx-devel_linu | 11.1 MB   | #######1   |  71% [A[A
2025-05-07T20:24:50.2301068Z 
2025-05-07T20:24:50.2301072Z 
2025-05-07T20:24:50.2301076Z 
2025-05-07T20:24:50.2301079Z 
2025-05-07T20:24:50.2302764Z 
2025-05-07T20:24:50.2303874Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:50.2304203Z 
2025-05-07T20:24:50.2304207Z 
2025-05-07T20:24:50.2304249Z 
2025-05-07T20:24:50.2304255Z 
2025-05-07T20:24:50.2304259Z 
2025-05-07T20:24:50.2339393Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:50.2794348Z gcc_impl_linux-64-11 | 53.0 MB   | 5          |   5% 
2025-05-07T20:24:50.2794618Z 
2025-05-07T20:24:50.2794844Z 
2025-05-07T20:24:50.2794850Z 
2025-05-07T20:24:50.2795059Z 
2025-05-07T20:24:50.2795069Z 
2025-05-07T20:24:50.2795075Z 
2025-05-07T20:24:50.2962208Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:24:50.2962680Z 
2025-05-07T20:24:50.2962697Z 
2025-05-07T20:24:50.3341587Z libstdcxx-devel_linu | 11.1 MB   | #########8 |  99% [A[A
2025-05-07T20:24:50.3523070Z gcc_impl_linux-64-11 | 53.0 MB   | #1         |  12% 
2025-05-07T20:24:50.3523355Z 
2025-05-07T20:24:50.3523422Z 
2025-05-07T20:24:50.3523429Z 
2025-05-07T20:24:50.3524040Z 
2025-05-07T20:24:50.3533227Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:50.3533515Z 
2025-05-07T20:24:50.3533759Z 
2025-05-07T20:24:50.3533771Z 
2025-05-07T20:24:50.3535153Z 
2025-05-07T20:24:50.3828131Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:50.3828415Z 
2025-05-07T20:24:50.3828420Z 
2025-05-07T20:24:50.3828425Z 
2025-05-07T20:24:50.4164538Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:50.4164827Z 
2025-05-07T20:24:50.4164833Z 
2025-05-07T20:24:50.4164838Z 
2025-05-07T20:24:50.4164842Z 
2025-05-07T20:24:50.4164848Z 
2025-05-07T20:24:50.4167986Z 
2025-05-07T20:24:50.4178173Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:50.4178476Z 
2025-05-07T20:24:50.4178481Z 
2025-05-07T20:24:50.4178485Z 
2025-05-07T20:24:50.4178490Z 
2025-05-07T20:24:50.4178500Z 
2025-05-07T20:24:50.4178504Z 
2025-05-07T20:24:50.4250366Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:50.4250669Z 
2025-05-07T20:24:50.4250672Z 
2025-05-07T20:24:50.4250677Z 
2025-05-07T20:24:50.4250701Z 
2025-05-07T20:24:50.4250705Z 
2025-05-07T20:24:50.4250709Z 
2025-05-07T20:24:50.4251990Z 
2025-05-07T20:24:50.4342023Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:24:50.4779776Z gcc_impl_linux-64-11 | 53.0 MB   | #8         |  19% 
2025-05-07T20:24:50.4780177Z 
2025-05-07T20:24:50.4780182Z 
2025-05-07T20:24:50.4780187Z 
2025-05-07T20:24:50.4780193Z 
2025-05-07T20:24:50.4780198Z 
2025-05-07T20:24:50.4780204Z 
2025-05-07T20:24:50.4780209Z 
2025-05-07T20:24:50.4884045Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:50.4884486Z 
2025-05-07T20:24:50.4884496Z 
2025-05-07T20:24:50.4884505Z 
2025-05-07T20:24:50.4884514Z 
2025-05-07T20:24:50.4884522Z 
2025-05-07T20:24:50.4884531Z 
2025-05-07T20:24:50.4884537Z 
2025-05-07T20:24:50.4892817Z 
2025-05-07T20:24:50.4930853Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.4931190Z 
2025-05-07T20:24:50.4931195Z 
2025-05-07T20:24:50.4931227Z 
2025-05-07T20:24:50.4931232Z 
2025-05-07T20:24:50.4931237Z 
2025-05-07T20:24:50.4931242Z 
2025-05-07T20:24:50.4931245Z 
2025-05-07T20:24:50.4933327Z 
2025-05-07T20:24:50.5260331Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.5260674Z 
2025-05-07T20:24:50.5260681Z 
2025-05-07T20:24:50.5260687Z 
2025-05-07T20:24:50.5260693Z 
2025-05-07T20:24:50.5260701Z 
2025-05-07T20:24:50.5260717Z 
2025-05-07T20:24:50.5260725Z 
2025-05-07T20:24:50.5260730Z 
2025-05-07T20:24:50.5264159Z 
2025-05-07T20:24:50.5297867Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.5298233Z 
2025-05-07T20:24:50.5298237Z 
2025-05-07T20:24:50.5298242Z 
2025-05-07T20:24:50.5298246Z 
2025-05-07T20:24:50.5298251Z 
2025-05-07T20:24:50.5298256Z 
2025-05-07T20:24:50.5298260Z 
2025-05-07T20:24:50.5298264Z 
2025-05-07T20:24:50.5299694Z 
2025-05-07T20:24:50.5343184Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.5456350Z gcc_impl_linux-64-11 | 53.0 MB   | ##4        |  24% 
2025-05-07T20:24:50.5456596Z 
2025-05-07T20:24:50.5456987Z 
2025-05-07T20:24:50.5457003Z 
2025-05-07T20:24:50.5457011Z 
2025-05-07T20:24:50.5457017Z 
2025-05-07T20:24:50.5457022Z 
2025-05-07T20:24:50.5457059Z 
2025-05-07T20:24:50.5457069Z 
2025-05-07T20:24:50.5457075Z 
2025-05-07T20:24:50.5463596Z 
2025-05-07T20:24:50.5506233Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.5506663Z 
2025-05-07T20:24:50.5506669Z 
2025-05-07T20:24:50.5506674Z 
2025-05-07T20:24:50.5506688Z 
2025-05-07T20:24:50.5506694Z 
2025-05-07T20:24:50.5506699Z 
2025-05-07T20:24:50.5506704Z 
2025-05-07T20:24:50.5506712Z 
2025-05-07T20:24:50.5506719Z 
2025-05-07T20:24:50.5509277Z 
2025-05-07T20:24:50.5720194Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.5720953Z 
2025-05-07T20:24:50.5721724Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:50.5722053Z 
2025-05-07T20:24:50.5772671Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:50.5772964Z 
2025-05-07T20:24:50.5772969Z 
2025-05-07T20:24:50.5772972Z 
2025-05-07T20:24:50.5772976Z 
2025-05-07T20:24:50.5772980Z 
2025-05-07T20:24:50.5773186Z 
2025-05-07T20:24:50.5773190Z 
2025-05-07T20:24:50.5773194Z 
2025-05-07T20:24:50.5773197Z 
2025-05-07T20:24:50.5773201Z 
2025-05-07T20:24:50.5773205Z 
2025-05-07T20:24:50.5806753Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.5807198Z 
2025-05-07T20:24:50.5807204Z 
2025-05-07T20:24:50.5807210Z 
2025-05-07T20:24:50.5807215Z 
2025-05-07T20:24:50.5807220Z 
2025-05-07T20:24:50.5807226Z 
2025-05-07T20:24:50.5807231Z 
2025-05-07T20:24:50.5807236Z 
2025-05-07T20:24:50.5807242Z 
2025-05-07T20:24:50.5807247Z 
2025-05-07T20:24:50.5807252Z 
2025-05-07T20:24:50.6042769Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.6043129Z 
2025-05-07T20:24:50.6045878Z 
2025-05-07T20:24:50.6347571Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:50.7159191Z gcc_impl_linux-64-11 | 53.0 MB   | ###3       |  33% 
2025-05-07T20:24:50.7159557Z 
2025-05-07T20:24:50.7159590Z 
2025-05-07T20:24:50.7159594Z 
2025-05-07T20:24:50.7159597Z 
2025-05-07T20:24:50.7159601Z 
2025-05-07T20:24:50.7159831Z 
2025-05-07T20:24:50.7402360Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:50.7567305Z gcc_impl_linux-64-11 | 53.0 MB   | ####7      |  47% 
2025-05-07T20:24:50.7567669Z 
2025-05-07T20:24:50.7567675Z 
2025-05-07T20:24:50.7567681Z 
2025-05-07T20:24:50.7567686Z 
2025-05-07T20:24:50.7568204Z 
2025-05-07T20:24:50.7616210Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:50.7616543Z 
2025-05-07T20:24:50.7616555Z 
2025-05-07T20:24:50.7616559Z 
2025-05-07T20:24:50.7616563Z 
2025-05-07T20:24:50.7616567Z 
2025-05-07T20:24:50.7616571Z 
2025-05-07T20:24:50.7616598Z 
2025-05-07T20:24:50.7622608Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:50.7623016Z 
2025-05-07T20:24:50.7623021Z 
2025-05-07T20:24:50.7623027Z 
2025-05-07T20:24:50.7623030Z 
2025-05-07T20:24:50.7623045Z 
2025-05-07T20:24:50.7623049Z 
2025-05-07T20:24:50.7625351Z 
2025-05-07T20:24:50.8367497Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:50.8367832Z 
2025-05-07T20:24:50.8367835Z 
2025-05-07T20:24:50.8367839Z 
2025-05-07T20:24:50.8367844Z 
2025-05-07T20:24:50.8367847Z 
2025-05-07T20:24:50.8367851Z 
2025-05-07T20:24:50.8367855Z 
2025-05-07T20:24:50.8367858Z 
2025-05-07T20:24:50.8367862Z 
2025-05-07T20:24:50.8368390Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.8368769Z 
2025-05-07T20:24:50.8368774Z 
2025-05-07T20:24:50.8368789Z 
2025-05-07T20:24:50.8368795Z 
2025-05-07T20:24:50.8368800Z 
2025-05-07T20:24:50.8368805Z 
2025-05-07T20:24:50.8368810Z 
2025-05-07T20:24:50.8368840Z 
2025-05-07T20:24:50.8368846Z 
2025-05-07T20:24:50.8401507Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.8401906Z 
2025-05-07T20:24:50.8401912Z 
2025-05-07T20:24:50.8401917Z 
2025-05-07T20:24:50.8401934Z 
2025-05-07T20:24:50.8401939Z 
2025-05-07T20:24:50.8401944Z 
2025-05-07T20:24:50.8401949Z 
2025-05-07T20:24:50.8402306Z 
2025-05-07T20:24:50.8407984Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.8408323Z 
2025-05-07T20:24:50.8408329Z 
2025-05-07T20:24:50.8408334Z 
2025-05-07T20:24:50.8408348Z 
2025-05-07T20:24:50.8408355Z 
2025-05-07T20:24:50.8408361Z 
2025-05-07T20:24:50.8408366Z 
2025-05-07T20:24:50.8408372Z 
2025-05-07T20:24:50.8485239Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.9034497Z gcc_impl_linux-64-11 | 53.0 MB   | #####5     |  56% 
2025-05-07T20:24:50.9034830Z 
2025-05-07T20:24:50.9034874Z 
2025-05-07T20:24:50.9035208Z 
2025-05-07T20:24:50.9035228Z 
2025-05-07T20:24:50.9035248Z 
2025-05-07T20:24:50.9035255Z 
2025-05-07T20:24:50.9035261Z 
2025-05-07T20:24:50.9035266Z 
2025-05-07T20:24:50.9035273Z 
2025-05-07T20:24:50.9035337Z 
2025-05-07T20:24:50.9043811Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.9044411Z 
2025-05-07T20:24:50.9044417Z 
2025-05-07T20:24:50.9044422Z 
2025-05-07T20:24:50.9044427Z 
2025-05-07T20:24:50.9044433Z 
2025-05-07T20:24:50.9044438Z 
2025-05-07T20:24:50.9044444Z 
2025-05-07T20:24:50.9044459Z 
2025-05-07T20:24:50.9044464Z 
2025-05-07T20:24:50.9046404Z 
2025-05-07T20:24:50.9487176Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.9704849Z gcc_impl_linux-64-11 | 53.0 MB   | ######6    |  66% 
2025-05-07T20:24:50.9705141Z 
2025-05-07T20:24:50.9705148Z 
2025-05-07T20:24:50.9705155Z 
2025-05-07T20:24:50.9705161Z 
2025-05-07T20:24:50.9705167Z 
2025-05-07T20:24:50.9705174Z 
2025-05-07T20:24:50.9705221Z 
2025-05-07T20:24:50.9705227Z 
2025-05-07T20:24:50.9705232Z 
2025-05-07T20:24:50.9705236Z 
2025-05-07T20:24:50.9705241Z 
2025-05-07T20:24:50.9713126Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.9713516Z 
2025-05-07T20:24:50.9713520Z 
2025-05-07T20:24:50.9713524Z 
2025-05-07T20:24:50.9713527Z 
2025-05-07T20:24:50.9713531Z 
2025-05-07T20:24:50.9713535Z 
2025-05-07T20:24:50.9713538Z 
2025-05-07T20:24:50.9713542Z 
2025-05-07T20:24:50.9713546Z 
2025-05-07T20:24:50.9713549Z 
2025-05-07T20:24:50.9713952Z 
2025-05-07T20:24:51.0488470Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.1493319Z gcc_impl_linux-64-11 | 53.0 MB   | #######7   |  77% 
2025-05-07T20:24:51.1511583Z gcc_impl_linux-64-11 | 53.0 MB   | ########8  |  88% 
2025-05-07T20:24:51.1511821Z 
2025-05-07T20:24:51.1511825Z 
2025-05-07T20:24:51.1512165Z 
2025-05-07T20:24:51.2516110Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:51.3002196Z gcc_impl_linux-64-11 | 53.0 MB   | #########8 |  98% 
2025-05-07T20:24:51.3002561Z 
2025-05-07T20:24:51.3587921Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:51.6359136Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:51.6359444Z 
2025-05-07T20:24:51.6359448Z 
2025-05-07T20:24:52.0859997Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:52.0862168Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:52.0862663Z                                                      
2025-05-07T20:24:52.0862970Z 
2025-05-07T20:24:52.0863222Z                                                      [A
2025-05-07T20:24:52.0863502Z 
2025-05-07T20:24:52.0863508Z 
2025-05-07T20:24:52.0863751Z                                                      [A[A
2025-05-07T20:24:52.0864033Z 
2025-05-07T20:24:52.0864038Z 
2025-05-07T20:24:52.0864043Z 
2025-05-07T20:24:52.0864308Z                                                      [A[A[A
2025-05-07T20:24:52.0864595Z 
2025-05-07T20:24:52.0864601Z 
2025-05-07T20:24:52.0864606Z 
2025-05-07T20:24:52.0864611Z 
2025-05-07T20:24:52.0864846Z                                                      [A[A[A[A
2025-05-07T20:24:52.0865152Z 
2025-05-07T20:24:52.0865157Z 
2025-05-07T20:24:52.0865163Z 
2025-05-07T20:24:52.0865168Z 
2025-05-07T20:24:52.0865173Z 
2025-05-07T20:24:52.0865415Z                                                      [A[A[A[A[A
2025-05-07T20:24:52.0865647Z 
2025-05-07T20:24:52.0865651Z 
2025-05-07T20:24:52.0865654Z 
2025-05-07T20:24:52.0865658Z 
2025-05-07T20:24:52.0865662Z 
2025-05-07T20:24:52.0865665Z 
2025-05-07T20:24:52.0865862Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:52.0866083Z 
2025-05-07T20:24:52.0866087Z 
2025-05-07T20:24:52.0866090Z 
2025-05-07T20:24:52.0866094Z 
2025-05-07T20:24:52.0866098Z 
2025-05-07T20:24:52.0866101Z 
2025-05-07T20:24:52.0866105Z 
2025-05-07T20:24:52.0866603Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:52.0866840Z 
2025-05-07T20:24:52.0866844Z 
2025-05-07T20:24:52.0866847Z 
2025-05-07T20:24:52.0866851Z 
2025-05-07T20:24:52.0866855Z 
2025-05-07T20:24:52.0866858Z 
2025-05-07T20:24:52.0867003Z 
2025-05-07T20:24:52.0867007Z 
2025-05-07T20:24:52.0867210Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:52.0867426Z 
2025-05-07T20:24:52.0867430Z 
2025-05-07T20:24:52.0867433Z 
2025-05-07T20:24:52.0867437Z 
2025-05-07T20:24:52.0867441Z 
2025-05-07T20:24:52.0867444Z 
2025-05-07T20:24:52.0867448Z 
2025-05-07T20:24:52.0867451Z 
2025-05-07T20:24:52.0867455Z 
2025-05-07T20:24:52.0867647Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.0867863Z 
2025-05-07T20:24:52.0867867Z 
2025-05-07T20:24:52.0867870Z 
2025-05-07T20:24:52.0867874Z 
2025-05-07T20:24:52.0867878Z 
2025-05-07T20:24:52.0867881Z 
2025-05-07T20:24:52.0867885Z 
2025-05-07T20:24:52.0867896Z 
2025-05-07T20:24:52.0867900Z 
2025-05-07T20:24:52.0867903Z 
2025-05-07T20:24:52.0868096Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.0868314Z 
2025-05-07T20:24:52.0868318Z 
2025-05-07T20:24:52.0868327Z 
2025-05-07T20:24:52.0868330Z 
2025-05-07T20:24:52.0868334Z 
2025-05-07T20:24:52.0868338Z 
2025-05-07T20:24:52.0868341Z 
2025-05-07T20:24:52.0868345Z 
2025-05-07T20:24:52.0868354Z 
2025-05-07T20:24:52.0868358Z 
2025-05-07T20:24:52.0868361Z 
2025-05-07T20:24:52.0868563Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:52.1878642Z Preparing transaction: \ done
2025-05-07T20:24:52.4889613Z Verifying transaction: / - \ done
2025-05-07T20:24:52.5905272Z Executing transaction: / done
2025-05-07T20:24:52.7546388Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:24:56.6265952Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:56.6266520Z 
2025-05-07T20:24:56.6279831Z 
2025-05-07T20:24:56.6297935Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:24:56.6298617Z 
2025-05-07T20:24:56.6310622Z 
2025-05-07T20:24:56.6328433Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:24:56.6329000Z 
2025-05-07T20:24:56.6340464Z 
2025-05-07T20:24:56.6357952Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:24:56.6358490Z 
2025-05-07T20:24:56.6370791Z 
2025-05-07T20:24:58.5182244Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:58.5182533Z 
2025-05-07T20:24:58.5813910Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:00.4598630Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:00.4598922Z 
2025-05-07T20:25:00.5232242Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:02.3970071Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:02.3970375Z 
2025-05-07T20:25:02.4588754Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:04.3356728Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:04.3357020Z 
2025-05-07T20:25:04.3974980Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:04.3979742Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:04.3980171Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:04.3980380Z 
2025-05-07T20:25:06.2822143Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:06.2822482Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:06.2822777Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:06.2823050Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:06.2823443Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:06.2826082Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:06.2826533Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:06.2826907Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:06.2827173Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:06.2827430Z #define __CHAR_BIT__ 8
2025-05-07T20:25:06.2827849Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:06.2828099Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:06.2828361Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:06.2828641Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:06.2828917Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:06.2829228Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.2829535Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:06.2829831Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:06.2830158Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:06.2830491Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:06.2830903Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:06.2831323Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:06.2831644Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:06.2831932Z #define __GCC_IEC_559 2
2025-05-07T20:25:06.2832182Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:06.2832468Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:06.2832739Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:06.2833020Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:06.2833385Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.2833738Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:06.2834017Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:06.2834294Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:06.2834564Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:06.2834835Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:06.2835095Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:06.2835363Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:06.2835634Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:06.2835956Z #define __INT8_C(c) c
2025-05-07T20:25:06.2836203Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:06.2836508Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.2836829Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:06.2837154Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:06.2837519Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:06.2837803Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:06.2838070Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.2838356Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:06.2838644Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:06.2839038Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:06.2839461Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:06.2839756Z #define __linux 1
2025-05-07T20:25:06.2839992Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:06.2840280Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:06.2840569Z #define __unix 1
2025-05-07T20:25:06.2840797Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:06.2841083Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:06.2841356Z #define __WINT_MIN__ 0U
2025-05-07T20:25:06.2841601Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:06.2841893Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:06.2842171Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:06.2842441Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:06.2842691Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:06.2842986Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:06.2843315Z #define __INT64_C(c) c ## L
2025-05-07T20:25:06.2843606Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:06.2843910Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:06.2844182Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:06.2844534Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:06.2844910Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:06.2845264Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:06.2845526Z #define __DBL_DIG__ 15
2025-05-07T20:25:06.2845759Z #define __FLT32_DIG__ 6
2025-05-07T20:25:06.2846065Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:06.2846416Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:06.2846741Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:06.2847069Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:06.2847415Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:06.2847664Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:06.2847929Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:06.2848309Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:06.2848706Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:06.2848983Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:06.2849242Z #define __unix__ 1
2025-05-07T20:25:06.2849463Z #define __INT_WIDTH__ 32
2025-05-07T20:25:06.2849717Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:06.2849975Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:06.2850226Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:06.2850498Z #define __UINT16_C(c) c
2025-05-07T20:25:06.2850742Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:06.2851003Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:06.2851364Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:06.2851736Z #define __gnu_linux__ 1
2025-05-07T20:25:06.2851986Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:06.2852263Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:06.2852557Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.2852835Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:06.2853111Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:06.2853409Z #define __GNUC__ 11
2025-05-07T20:25:06.2853630Z #define __pie__ 2
2025-05-07T20:25:06.2862727Z #define __MMX__ 1
2025-05-07T20:25:06.2862968Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:06.2863239Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:06.2863573Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:06.2863850Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:06.2864201Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:06.2864603Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.2864920Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:06.2865189Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:06.2865460Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:06.2865749Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:06.2866004Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:06.2866257Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:06.2866539Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:06.2866829Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:06.2867102Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:06.2867383Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:06.2867630Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:06.2867893Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:06.2868162Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:06.2868418Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:06.2868674Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:06.2868993Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:06.2869349Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:06.2869622Z #define __SSE2_MATH__ 1
2025-05-07T20:25:06.2869867Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:06.2870164Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.2870459Z #define __amd64 1
2025-05-07T20:25:06.2870688Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:06.2870952Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:06.2871260Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:06.2871573Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:06.2871834Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:06.2872305Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:06.2872564Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:06.2873204Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:06.2873468Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:06.2873734Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:06.2873998Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:06.2874271Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:06.2874650Z #define __x86_64 1
2025-05-07T20:25:06.2874885Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:06.2875250Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:06.2875707Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:06.2876156Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:06.2876621Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:06.2876998Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:06.2877249Z #define __LP64__ 1
2025-05-07T20:25:06.2877479Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.2877829Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:06.2878208Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:06.2878481Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:06.2878751Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:06.2879040Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:06.2879315Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:06.2879577Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:06.2879839Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:06.2880099Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:06.2880361Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:06.2880684Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:06.2881041Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:06.2881318Z #define __FLT_DIG__ 6
2025-05-07T20:25:06.2881543Z #define __NO_INLINE__ 1
2025-05-07T20:25:06.2881787Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:06.2882113Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:06.2882456Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:06.2882716Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:06.2882980Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:06.2883254Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:06.2883542Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:06.2883804Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:06.2884095Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:06.2884379Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:06.2884645Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:06.2884950Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:06.2885273Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:06.2885537Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:06.2885794Z #define __FLT128_DIG__ 33
2025-05-07T20:25:06.2886026Z #define __INT32_C(c) c
2025-05-07T20:25:06.2886268Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:06.2886549Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:06.2886824Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:06.2887103Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:06.2887419Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:06.2887716Z #define unix 1
2025-05-07T20:25:06.2887948Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:06.2888264Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.2888560Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:06.2888871Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:06.2889200Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:06.2889451Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:06.2889707Z #define __ELF__ 1
2025-05-07T20:25:06.2889940Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:06.2890223Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:06.2890490Z #define __FLT_RADIX__ 2
2025-05-07T20:25:06.2890742Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:06.2891104Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:06.2891567Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:06.2891826Z #define __SSE_MATH__ 1
2025-05-07T20:25:06.2892051Z #define __k8 1
2025-05-07T20:25:06.2892341Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:06.2892828Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:06.2893124Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:06.2893445Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:06.2893726Z #define __LDBL_DIG__ 18
2025-05-07T20:25:06.2893968Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:06.2894222Z #define __x86_64__ 1
2025-05-07T20:25:06.2894452Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:06.2894749Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:06.2895083Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.2895381Z #define __FLT64_DIG__ 15
2025-05-07T20:25:06.2895670Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.2896030Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:06.2896341Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.2896610Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:06.2896887Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.2897179Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:06.2897554Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:06.2897950Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:06.2898364Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:06.2898699Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:06.2899021Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:06.2899318Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:06.2899595Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:06.2899906Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:06.2900186Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:06.2900421Z #define __SEG_FS 1
2025-05-07T20:25:06.2900652Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:06.2900937Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:06.2901205Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.2901492Z #define __SEG_GS 1
2025-05-07T20:25:06.2901801Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:06.2902186Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:06.2902453Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:06.2902743Z #define __INT16_TYPE__ short int
2025-05-07T20:25:06.2903022Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:06.2903309Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:06.2903574Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:06.2903821Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:06.2904074Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:06.2904414Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:06.2904798Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.2905080Z #define linux 1
2025-05-07T20:25:06.2905307Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.2905585Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:06.2905850Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:06.2906103Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:06.2906362Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:06.2906625Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:06.2906972Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:06.2907382Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:06.2907717Z #define __code_model_small__ 1
2025-05-07T20:25:06.2907991Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:06.2908281Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:06.2908527Z #define __k8__ 1
2025-05-07T20:25:06.2908748Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:06.2909036Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:06.2909340Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:06.2909575Z #define __pic__ 2
2025-05-07T20:25:06.2909927Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.2910242Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:06.2910531Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.2910865Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:06.2911235Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:06.2911667Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:06.2911933Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:06.2912227Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:06.2912540Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:06.2912785Z #define __linux__ 1
2025-05-07T20:25:06.2913016Z #define __INT64_TYPE__ long int
2025-05-07T20:25:06.2913303Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:06.2913584Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:06.2913858Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:06.2914119Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:06.2914409Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.2914752Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:06.2915049Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:06.2915314Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:06.2915627Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:06.2915930Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:06.2916256Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:06.2916614Z #define __SSE__ 1
2025-05-07T20:25:06.2916850Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:06.2917181Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:06.2917521Z #define __amd64__ 1
2025-05-07T20:25:06.2917745Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:06.2917992Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:06.2918268Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:06.2918540Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:06.2918807Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:06.2919082Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:06.2919343Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:06.2919618Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:06.2919878Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:06.2920229Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:06.2920699Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:06.2921042Z #define _LP64 1
2025-05-07T20:25:06.2921255Z #define __UINT8_C(c) c
2025-05-07T20:25:06.2921494Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:06.2921752Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:06.2922021Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:06.2922293Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:06.2922587Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:06.2922941Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:06.2923432Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:06.2923833Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.2924124Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:06.2924437Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:06.2924803Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:06.2925170Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:06.2925434Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:06.2925773Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:06.2926131Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:06.2926395Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:06.2926644Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:06.2926892Z #define __FXSR__ 1
2025-05-07T20:25:06.2927191Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:06.2927639Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:06.2928048Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:06.2928449Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:06.2928707Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:06.2929042Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:06.2929397Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:06.2929717Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:06.2929954Z #define __PIC__ 2
2025-05-07T20:25:06.2930203Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:06.2930605Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:06.2930990Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:06.2931326Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:06.2931646Z #define __SSE2__ 1
2025-05-07T20:25:06.2931871Z #define __INT32_TYPE__ int
2025-05-07T20:25:06.2932128Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:06.2932384Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:06.2932721Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:06.2933082Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:06.2933348Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:06.2933619Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:06.2933889Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.2934164Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:06.2934416Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:06.2934667Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:06.2934962Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.2935255Z #define __PIE__ 2
2025-05-07T20:25:06.2935579Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:06.2935970Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:06.2936306Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:06.2936668Z #define __INT16_C(c) c
2025-05-07T20:25:06.2936892Z #define __STDC__ 1
2025-05-07T20:25:06.2937117Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:06.2937391Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:06.2937658Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:06.2937953Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:06.2938407Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:06.2938739Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:06.2939010Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:06.2939285Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:06.2939550Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:06.2939835Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:06.2940123Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:06.2940394Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:06.2940692Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:06.2941082Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:06.2941456Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:06.2941762Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:06.2942054Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:06.2942310Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:06.2942473Z 
2025-05-07T20:25:06.3462444Z 
2025-05-07T20:25:06.3463266Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:06.3463791Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:06.3464051Z 
2025-05-07T20:25:08.2259820Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:08.2260185Z #define __cpp_attributes 200809L
2025-05-07T20:25:08.2260553Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:08.2261045Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:08.2261423Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:08.2261799Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:08.2262264Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:08.2262743Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:08.2263133Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:08.2263559Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:08.2264362Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:08.2264775Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:08.2265152Z #define __CHAR_BIT__ 8
2025-05-07T20:25:08.2265424Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:08.2265747Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:08.2266392Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:08.2266800Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:08.2267085Z #define __cpp_static_assert 201411L
2025-05-07T20:25:08.2267390Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:08.2267703Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.2268011Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:08.2268316Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:08.2268654Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:08.2268988Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:08.2269395Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:08.2269821Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:08.2270142Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:08.2270425Z #define __GCC_IEC_559 2
2025-05-07T20:25:08.2270678Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:08.2270961Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:08.2271245Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:08.2271544Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:08.2271845Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:08.2272168Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:08.2272489Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:08.2272830Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.2273160Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:08.2273434Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:08.2273722Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:08.2274009Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:08.2274311Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:08.2274590Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:08.2274863Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:08.2275144Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:08.2275488Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:08.2275829Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:08.2276083Z #define __INT8_C(c) c
2025-05-07T20:25:08.2276331Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:08.2276614Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:08.2276938Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.2277271Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:08.2277555Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:08.2277853Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:08.2278171Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:08.2278530Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:08.2278823Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:08.2279110Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:08.2279386Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.2279679Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:08.2279963Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:08.2280368Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:08.2280793Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:08.2281084Z #define __linux 1
2025-05-07T20:25:08.2281325Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:08.2281614Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:08.2281903Z #define __unix 1
2025-05-07T20:25:08.2282133Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:08.2282433Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:08.2282736Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:08.2283009Z #define __WINT_MIN__ 0U
2025-05-07T20:25:08.2283265Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:08.2283564Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:08.2284007Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:08.2284287Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:08.2284551Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:08.2284835Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:08.2285142Z #define __INT64_C(c) c ## L
2025-05-07T20:25:08.2285501Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:08.2285800Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:08.2286085Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:08.2286396Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:08.2286681Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:08.2286948Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:08.2287306Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:08.2287706Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:08.2287971Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:08.2288251Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:08.2288535Z #define __DBL_DIG__ 15
2025-05-07T20:25:08.2288784Z #define __FLT32_DIG__ 6
2025-05-07T20:25:08.2289093Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:08.2289442Z #define __GXX_WEAK__ 1
2025-05-07T20:25:08.2289687Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:08.2289940Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:08.2290281Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:08.2290635Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:08.2299125Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:08.2299470Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:08.2299811Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:08.2300224Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:08.2300627Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:08.2300905Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:08.2301167Z #define __unix__ 1
2025-05-07T20:25:08.2301397Z #define __INT_WIDTH__ 32
2025-05-07T20:25:08.2301645Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:08.2301898Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:08.2302157Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:08.2302422Z #define __UINT16_C(c) c
2025-05-07T20:25:08.2302662Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:08.2302926Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:08.2303279Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:08.2303643Z #define __gnu_linux__ 1
2025-05-07T20:25:08.2303885Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:08.2304144Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:08.2304429Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:08.2304717Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.2304988Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:08.2305245Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:08.2305498Z #define __GNUC__ 11
2025-05-07T20:25:08.2305718Z #define __GXX_RTTI 1
2025-05-07T20:25:08.2305936Z #define __pie__ 2
2025-05-07T20:25:08.2306157Z #define __MMX__ 1
2025-05-07T20:25:08.2306382Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:08.2306643Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:08.2306932Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:08.2307199Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:08.2307447Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:08.2307748Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:08.2308069Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:08.2308411Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:08.2308783Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:08.2309094Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.2309404Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:08.2309667Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:08.2309939Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:08.2310249Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:08.2310541Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:08.2311036Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:08.2311305Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:08.2311587Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:08.2311885Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:08.2312157Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:08.2312533Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:08.2312793Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:08.2313060Z #define __cplusplus 201703L
2025-05-07T20:25:08.2313326Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:08.2313620Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:08.2313928Z #define __DEPRECATED 1
2025-05-07T20:25:08.2314184Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:08.2314487Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:08.2314748Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:08.2315067Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:08.2315424Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:08.2315705Z #define __SSE2_MATH__ 1
2025-05-07T20:25:08.2315960Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:08.2316257Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.2316551Z #define __amd64 1
2025-05-07T20:25:08.2316775Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:08.2317046Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:08.2317313Z #define __GNUG__ 11
2025-05-07T20:25:08.2317571Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:08.2317877Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:08.2318133Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:08.2318393Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:08.2318663Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:08.2318921Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:08.2319200Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:08.2319497Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:08.2319756Z #define __cpp_hex_float 201603L
2025-05-07T20:25:08.2320030Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:08.2320308Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:08.2320578Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:08.2320847Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:08.2321116Z #define __x86_64 1
2025-05-07T20:25:08.2321338Z #define __cpp_lambdas 200907L
2025-05-07T20:25:08.2321613Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:08.2321986Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:08.2322368Z #define __cpp_template_auto 201606L
2025-05-07T20:25:08.2322726Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:08.2323176Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:08.2323644Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:08.2324062Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:08.2324323Z #define __LP64__ 1
2025-05-07T20:25:08.2324553Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.2324903Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:08.2325282Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:08.2325562Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:08.2325838Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:08.2326120Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:08.2326395Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:08.2326649Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:08.2326915Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:08.2327247Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:08.2327606Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:08.2327881Z #define __FLT_DIG__ 6
2025-05-07T20:25:08.2328118Z #define __NO_INLINE__ 1
2025-05-07T20:25:08.2328361Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:08.2328683Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:08.2329033Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:08.2329297Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:08.2329713Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:08.2329975Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:08.2330257Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:08.2330551Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:08.2330809Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:08.2331233Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:08.2331517Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:08.2331789Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:08.2332095Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:08.2332438Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:08.2332724Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:08.2332988Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:08.2333245Z #define __FLT128_DIG__ 33
2025-05-07T20:25:08.2333481Z #define __INT32_C(c) c
2025-05-07T20:25:08.2333728Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:08.2334016Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:08.2334372Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:08.2334656Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:08.2334968Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:08.2335274Z #define unix 1
2025-05-07T20:25:08.2335496Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:08.2335757Z #define __cpp_rtti 199711L
2025-05-07T20:25:08.2336025Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:08.2336338Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.2336641Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:08.2336953Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:08.2337285Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:08.2337533Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:08.2337823Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:08.2338260Z #define __ELF__ 1
2025-05-07T20:25:08.2338537Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:08.2338818Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:08.2339101Z #define __FLT_RADIX__ 2
2025-05-07T20:25:08.2339356Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:08.2339711Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:08.2340076Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:08.2340352Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:08.2340629Z #define __k8 1
2025-05-07T20:25:08.2340929Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:08.2341303Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:08.2341597Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:08.2341901Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:08.2342168Z #define __LDBL_DIG__ 18
2025-05-07T20:25:08.2342407Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:08.2342666Z #define __x86_64__ 1
2025-05-07T20:25:08.2342906Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:08.2343209Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:08.2343551Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.2343906Z #define __FLT64_DIG__ 15
2025-05-07T20:25:08.2344195Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.2344541Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:08.2344859Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.2345130Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:08.2345402Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.2345703Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:08.2346070Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:08.2346461Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:08.2346755Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:08.2347079Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:08.2347400Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:08.2347718Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:08.2348014Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:08.2348407Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:08.2348713Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:08.2348997Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:08.2349242Z #define __SEG_FS 1
2025-05-07T20:25:08.2349469Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:08.2349832Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:08.2350110Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.2350392Z #define __SEG_GS 1
2025-05-07T20:25:08.2350704Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:08.2351087Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:08.2351365Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:08.2351650Z #define __INT16_TYPE__ short int
2025-05-07T20:25:08.2351946Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:08.2352261Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:08.2352562Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:08.2352810Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:08.2353081Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:08.2353425Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:08.2353817Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.2354180Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:08.2354518Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:08.2354823Z #define linux 1
2025-05-07T20:25:08.2355047Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.2355326Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:08.2356037Z #define __EXCEPTIONS 1
2025-05-07T20:25:08.2356387Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:08.2356748Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:08.2357097Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:08.2357388Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:08.2357738Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:08.2358123Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:08.2358474Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:08.2358804Z #define __code_model_small__ 1
2025-05-07T20:25:08.2359078Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:08.2359392Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:08.2359693Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:08.2359971Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:08.2360261Z #define __k8__ 1
2025-05-07T20:25:08.2360485Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:08.2360772Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:08.2361071Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:08.2361309Z #define __pic__ 2
2025-05-07T20:25:08.2361561Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.2361877Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:08.2362146Z #define __cpp_decltype 200707L
2025-05-07T20:25:08.2362441Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.2362773Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:08.2363141Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:08.2363499Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:08.2363838Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:08.2364168Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:08.2364458Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:08.2364713Z #define __linux__ 1
2025-05-07T20:25:08.2364944Z #define __INT64_TYPE__ long int
2025-05-07T20:25:08.2365202Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:08.2365467Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:08.2365742Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:08.2366026Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:08.2366345Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:08.2366644Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.2366954Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:08.2367225Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:08.2367777Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:08.2368084Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:08.2368408Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:08.2368765Z #define __SSE__ 1
2025-05-07T20:25:08.2368998Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:08.2369471Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:08.2369810Z #define __amd64__ 1
2025-05-07T20:25:08.2370039Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:08.2370289Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:08.2370565Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:08.2370833Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:08.2371103Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:08.2371363Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:08.2371642Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:08.2371902Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:08.2372248Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:08.2372714Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:08.2373064Z #define _LP64 1
2025-05-07T20:25:08.2373274Z #define __UINT8_C(c) c
2025-05-07T20:25:08.2373520Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:08.2373811Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:08.2374105Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:08.2374369Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:08.2374725Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:08.2375176Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:08.2375551Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.2375849Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:08.2376154Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:08.2376465Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:08.2376847Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:08.2377217Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:08.2377477Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:08.2377742Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:08.2378168Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:08.2378531Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:08.2378792Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:08.2379042Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:08.2379287Z #define __FXSR__ 1
2025-05-07T20:25:08.2379590Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:08.2380037Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:08.2380443Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:08.2380746Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:08.2381016Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:08.2381317Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:08.2381606Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:08.2381883Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:08.2382246Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:08.2382606Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:08.2382877Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:08.2383130Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:08.2383367Z #define __PIC__ 2
2025-05-07T20:25:08.2383624Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:08.2384057Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:08.2384467Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:08.2384801Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:08.2385148Z #define __cpp_constexpr 201603L
2025-05-07T20:25:08.2385413Z #define __SSE2__ 1
2025-05-07T20:25:08.2385646Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:08.2385939Z #define __INT32_TYPE__ int
2025-05-07T20:25:08.2386194Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:08.2386555Z #define __cpp_exceptions 199711L
2025-05-07T20:25:08.2386836Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:08.2387171Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:08.2387523Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:08.2387950Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:08.2388223Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:08.2388485Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.2388761Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:08.2389012Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:08.2389273Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:08.2389560Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:08.2389853Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.2390150Z #define __PIE__ 2
2025-05-07T20:25:08.2390467Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:08.2390880Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:08.2391196Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:08.2391536Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:08.2391900Z #define __INT16_C(c) c
2025-05-07T20:25:08.2392127Z #define __STDC__ 1
2025-05-07T20:25:08.2392343Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:08.2392609Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:08.2392884Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:08.2393142Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:08.2393436Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:08.2393819Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:08.2394162Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:08.2394425Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:08.2394719Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:08.2395002Z #define __SSE_MATH__ 1
2025-05-07T20:25:08.2395239Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:08.2395528Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:08.2395843Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:08.2396125Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:08.2396420Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:08.2396696Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:08.2396993Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:08.2397395Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:08.2397769Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:08.2398079Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:08.2398367Z #define _GNU_SOURCE 1
2025-05-07T20:25:08.2398616Z #define __cpp_init_captures 201304L
2025-05-07T20:25:08.2398899Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:08.2399147Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:08.2399318Z 
2025-05-07T20:25:08.2901208Z 
2025-05-07T20:25:08.2901944Z + conda run -n build_binary c++ --version
2025-05-07T20:25:08.2902202Z 
2025-05-07T20:25:10.1615770Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:10.1616192Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:10.1616645Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:10.1617184Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:10.1617523Z 
2025-05-07T20:25:10.1617527Z 
2025-05-07T20:25:10.2242804Z 
2025-05-07T20:25:10.2243665Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:10.2244239Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:10.2244592Z 
2025-05-07T20:25:12.1665937Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:12.1668401Z 
2025-05-07T20:25:12.1668869Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:12.1669435Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:12.1669749Z 
2025-05-07T20:25:14.1113662Z #define __cplusplus 201703L
2025-05-07T20:25:14.1115851Z 
2025-05-07T20:25:14.1116805Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:14.1166296Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3
2025-05-07T20:25:14.1166722Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.6.3[0m
2025-05-07T20:25:14.1179768Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:14.1180120Z env:
2025-05-07T20:25:14.1180356Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:14.1180654Z   BUILD_ENV: build_binary
2025-05-07T20:25:14.1180903Z   BUILD_TARGET: genai
2025-05-07T20:25:14.1181140Z   BUILD_VARIANT: cuda
2025-05-07T20:25:14.1181373Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:14.1181633Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:14.1181940Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:14.1182269Z ##[endgroup]
2025-05-07T20:25:14.4525831Z ################################################################################
2025-05-07T20:25:14.4526324Z # Install CUDA
2025-05-07T20:25:14.4526613Z #
2025-05-07T20:25:14.4540807Z # [2025-05-07T20:25:14.453Z] + install_cuda build_binary 12.6.3
2025-05-07T20:25:14.4541344Z ################################################################################
2025-05-07T20:25:14.4541641Z 
2025-05-07T20:25:14.4556121Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:14.5413473Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:14.5413978Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:14.5417939Z + conda clean --packages --tarball -y
2025-05-07T20:25:14.5418315Z 
2025-05-07T20:25:15.2504669Z Will remove 32 (142.2 MB) tarball(s).
2025-05-07T20:25:15.2505340Z Will remove 6 (617 KB) package(s).
2025-05-07T20:25:15.3123207Z 
2025-05-07T20:25:15.3131447Z + conda clean --all -y
2025-05-07T20:25:15.3131652Z 
2025-05-07T20:25:15.9839713Z There are no unused tarball(s) to remove.
2025-05-07T20:25:15.9840550Z Will remove 1 index cache(s).
2025-05-07T20:25:15.9841200Z There are no unused package(s) to remove.
2025-05-07T20:25:15.9841849Z There are no tempfile(s) to remove.
2025-05-07T20:25:15.9842456Z There are no logfile(s) to remove.
2025-05-07T20:25:16.0462557Z 
2025-05-07T20:25:16.0476599Z [INSTALL] Installing CUDA 12.6.3 ...
2025-05-07T20:25:16.0500513Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3
2025-05-07T20:25:16.9585733Z Channels:
2025-05-07T20:25:16.9586119Z  - conda-forge
2025-05-07T20:25:16.9586443Z Platform: linux-64
2025-05-07T20:25:27.4431921Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:25:28.5172551Z Solving environment: / - \ | done
2025-05-07T20:25:28.5903816Z 
2025-05-07T20:25:28.5904139Z ## Package Plan ##
2025-05-07T20:25:28.5904385Z 
2025-05-07T20:25:28.5904680Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:28.5905013Z 
2025-05-07T20:25:28.5905117Z   added / updated specs:
2025-05-07T20:25:28.5905359Z     - cuda=12.6.3
2025-05-07T20:25:28.5905500Z 
2025-05-07T20:25:28.5905530Z 
2025-05-07T20:25:28.5905655Z The following packages will be downloaded:
2025-05-07T20:25:28.5905875Z 
2025-05-07T20:25:28.5905998Z     package                    |            build
2025-05-07T20:25:28.5906472Z     ---------------------------|-----------------
2025-05-07T20:25:28.5906938Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:25:28.5907519Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:25:28.5908154Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:25:28.5908739Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:25:28.5909259Z     cuda-12.6.3                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:25:28.5909694Z     cuda-cccl_linux-64-12.6.77 |       ha770c72_0         1.0 MB  conda-forge
2025-05-07T20:25:28.5910200Z     cuda-command-line-tools-12.6.3|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.5911679Z     cuda-compiler-12.6.3       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:25:28.5912180Z     cuda-crt-dev_linux-64-12.6.85|       ha770c72_0          87 KB  conda-forge
2025-05-07T20:25:28.5912808Z     cuda-crt-tools-12.6.85     |       ha770c72_0          26 KB  conda-forge
2025-05-07T20:25:28.5913258Z     cuda-cudart-12.6.77        |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:28.5913737Z     cuda-cudart-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:28.5914234Z     cuda-cudart-dev_linux-64-12.6.77|       h3f2d84a_0         357 KB  conda-forge
2025-05-07T20:25:28.5914740Z     cuda-cudart-static-12.6.77 |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:28.5915254Z     cuda-cudart-static_linux-64-12.6.77|       h3f2d84a_0         744 KB  conda-forge
2025-05-07T20:25:28.5915775Z     cuda-cudart_linux-64-12.6.77|       h3f2d84a_0         184 KB  conda-forge
2025-05-07T20:25:28.5916262Z     cuda-cuobjdump-12.6.77     |       hbd13f7d_1         241 KB  conda-forge
2025-05-07T20:25:28.5916776Z     cuda-cupti-12.6.80         |       hbd13f7d_0         1.9 MB  conda-forge
2025-05-07T20:25:28.5917228Z     cuda-cupti-dev-12.6.80     |       h5888daf_0         3.4 MB  conda-forge
2025-05-07T20:25:28.5917693Z     cuda-cuxxfilt-12.6.77      |       hbd13f7d_1         211 KB  conda-forge
2025-05-07T20:25:28.5918159Z     cuda-driver-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:25:28.5918645Z     cuda-driver-dev_linux-64-12.6.77|       h3f2d84a_0          35 KB  conda-forge
2025-05-07T20:25:28.5919116Z     cuda-gdb-12.6.77           |       h50b4baa_1         370 KB  conda-forge
2025-05-07T20:25:28.5919559Z     cuda-libraries-12.6.3      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.5920038Z     cuda-libraries-dev-12.6.3  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.5920508Z     cuda-nsight-12.6.77        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:25:28.5920951Z     cuda-nvcc-12.6.85          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:25:28.5921414Z     cuda-nvcc-dev_linux-64-12.6.85|       he91c749_0        10.8 MB  conda-forge
2025-05-07T20:25:28.5921895Z     cuda-nvcc-impl-12.6.85     |       h85509e4_0          25 KB  conda-forge
2025-05-07T20:25:28.5922357Z     cuda-nvcc-tools-12.6.85    |       he02047a_0        23.0 MB  conda-forge
2025-05-07T20:25:28.5922825Z     cuda-nvcc_linux-64-12.6.85 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:25:28.5923287Z     cuda-nvdisasm-12.6.77      |       hbd13f7d_1        47.6 MB  conda-forge
2025-05-07T20:25:28.5923736Z     cuda-nvml-dev-12.6.77      |       hbd13f7d_1         159 KB  conda-forge
2025-05-07T20:25:28.5924183Z     cuda-nvprof-12.6.80        |       hbd13f7d_0         2.6 MB  conda-forge
2025-05-07T20:25:28.5924631Z     cuda-nvprune-12.6.77       |       hbd13f7d_1          66 KB  conda-forge
2025-05-07T20:25:28.5925077Z     cuda-nvrtc-12.6.85         |       hbd13f7d_0        17.3 MB  conda-forge
2025-05-07T20:25:28.5925516Z     cuda-nvrtc-dev-12.6.85     |       h5888daf_0          31 KB  conda-forge
2025-05-07T20:25:28.5925964Z     cuda-nvtx-12.6.77          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:25:28.5926422Z     cuda-nvvm-dev_linux-64-12.6.85|       ha770c72_0          25 KB  conda-forge
2025-05-07T20:25:28.5926891Z     cuda-nvvm-impl-12.6.85     |       he02047a_0         7.7 MB  conda-forge
2025-05-07T20:25:28.5927351Z     cuda-nvvm-tools-12.6.85    |       he02047a_0        10.4 MB  conda-forge
2025-05-07T20:25:28.5927797Z     cuda-nvvp-12.6.80          |       hbd13f7d_1       109.3 MB  conda-forge
2025-05-07T20:25:28.5928237Z     cuda-opencl-12.6.77        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:25:28.5928693Z     cuda-opencl-dev-12.6.77    |       h5888daf_0          93 KB  conda-forge
2025-05-07T20:25:28.5929219Z     cuda-profiler-api-12.6.77  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:25:28.5929816Z     cuda-runtime-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:28.5930286Z     cuda-sanitizer-api-12.6.77 |       hbd13f7d_1         8.9 MB  conda-forge
2025-05-07T20:25:28.5930840Z     cuda-toolkit-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:25:28.5931277Z     cuda-tools-12.6.3          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:28.5931709Z     cuda-version-12.6          |       h7480c83_3          20 KB  conda-forge
2025-05-07T20:25:28.5932163Z     cuda-visual-tools-12.6.3   |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:28.5932628Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:25:28.5933061Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:25:28.5933444Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:25:28.5933915Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:25:28.5934433Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:25:28.5934951Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:25:28.5935448Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:25:28.5935901Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:25:28.5936366Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:25:28.5936834Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:25:28.5937276Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:25:28.5937677Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:28.5938218Z     gds-tools-1.11.1.6         |       h5888daf_4        37.8 MB  conda-forge
2025-05-07T20:25:28.5938620Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:25:28.5939000Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:28.5939402Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:25:28.5939797Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:25:28.5940187Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:25:28.5940605Z     libcublas-12.6.4.1         |       h5888daf_1       256.2 MB  conda-forge
2025-05-07T20:25:28.5941054Z     libcublas-dev-12.6.4.1     |       h5888daf_1          88 KB  conda-forge
2025-05-07T20:25:28.5941493Z     libcufft-11.3.0.4          |       hbd13f7d_0       156.2 MB  conda-forge
2025-05-07T20:25:28.5941932Z     libcufft-dev-11.3.0.4      |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:25:28.5942378Z     libcufile-1.11.1.6         |       h12f29b5_4         900 KB  conda-forge
2025-05-07T20:25:28.5942823Z     libcufile-dev-1.11.1.6     |       h5888daf_4          35 KB  conda-forge
2025-05-07T20:25:28.5943268Z     libcurand-10.3.7.77        |       hbd13f7d_0        39.9 MB  conda-forge
2025-05-07T20:25:28.5943722Z     libcurand-dev-10.3.7.77    |       h5888daf_0         262 KB  conda-forge
2025-05-07T20:25:28.5944178Z     libcusolver-11.7.1.2       |       h5888daf_1        95.8 MB  conda-forge
2025-05-07T20:25:28.5944637Z     libcusolver-dev-11.7.1.2   |       h5888daf_1          59 KB  conda-forge
2025-05-07T20:25:28.5945101Z     libcusparse-12.5.4.2       |       hbd13f7d_0       118.6 MB  conda-forge
2025-05-07T20:25:28.5945566Z     libcusparse-dev-12.5.4.2   |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:25:28.5946033Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:25:28.5946465Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:25:28.5947002Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:25:28.5947453Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:25:28.5947974Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:25:28.5948411Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:25:28.5948858Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:25:28.5949321Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:25:28.5949720Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:25:28.5950126Z     libnpp-12.3.1.54           |       h5888daf_0        93.4 MB  conda-forge
2025-05-07T20:25:28.5950554Z     libnpp-dev-12.3.1.54       |       h5888daf_0         441 KB  conda-forge
2025-05-07T20:25:28.5950976Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:25:28.5951382Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:25:28.5951811Z     libnvfatbin-12.6.77        |       hbd13f7d_0         783 KB  conda-forge
2025-05-07T20:25:28.5952281Z     libnvfatbin-dev-12.6.77    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:25:28.5952742Z     libnvjitlink-12.6.85       |       hbd13f7d_0        14.9 MB  conda-forge
2025-05-07T20:25:28.5953212Z     libnvjitlink-dev-12.6.85   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:25:28.5953675Z     libnvjpeg-12.3.3.54        |       h5888daf_0         2.4 MB  conda-forge
2025-05-07T20:25:28.5954131Z     libnvjpeg-dev-12.3.3.54    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:25:28.5954559Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:25:28.5954977Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:25:28.5955417Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:25:28.5956301Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:25:28.5956732Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:25:28.5957138Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:25:28.5957567Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:25:28.5958007Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:25:28.5958429Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:25:28.5958842Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:25:28.5959268Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:25:28.5959731Z     nsight-compute-2024.3.2.3  |       hb5ebaad_0       443.1 MB  conda-forge
2025-05-07T20:25:28.5960174Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:25:28.5960556Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:25:28.5960949Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:25:28.5961398Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:25:28.5961839Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:25:28.5962262Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:25:28.5962717Z     python-3.10.13             |hd12c33a_1_cpython        24.5 MB  conda-forge
2025-05-07T20:25:28.5963150Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:25:28.5963559Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:25:28.5964118Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:25:28.5964529Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:25:28.5964938Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:25:28.5965516Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:25:28.5965971Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:25:28.5966430Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:25:28.5966914Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:25:28.5967369Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:25:28.5967823Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:25:28.5968281Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:25:28.5968717Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:25:28.5969142Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:25:28.5969589Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:25:28.5970062Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:25:28.5970539Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:25:28.5971000Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:28.5971450Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:25:28.5971903Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:28.5972346Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:25:28.5972799Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:25:28.5973268Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:25:28.5973722Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:25:28.5974145Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:25:28.5974529Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:25:28.5974906Z     ------------------------------------------------------------
2025-05-07T20:25:28.5975241Z                                            Total:        1.63 GB
2025-05-07T20:25:28.5975456Z 
2025-05-07T20:25:28.5975590Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:28.5975814Z 
2025-05-07T20:25:28.5976020Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:25:28.5976445Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:25:28.5976868Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:25:28.5977334Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:25:28.5977773Z   cuda               conda-forge/noarch::cuda-12.6.3-ha804496_0 
2025-05-07T20:25:28.5978321Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 
2025-05-07T20:25:28.5978950Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:28.5979560Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 
2025-05-07T20:25:28.5980109Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:28.5980667Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 
2025-05-07T20:25:28.5981193Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 
2025-05-07T20:25:28.5981720Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:28.5982405Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:28.5983009Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 
2025-05-07T20:25:28.5983890Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:28.5984503Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:28.5985067Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.5985582Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 
2025-05-07T20:25:28.5986092Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 
2025-05-07T20:25:28.5986627Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.5987170Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:28.5987748Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:25:28.5988278Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 
2025-05-07T20:25:28.5988779Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 
2025-05-07T20:25:28.5989400Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 
2025-05-07T20:25:28.5989943Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 
2025-05-07T20:25:28.5990425Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 
2025-05-07T20:25:28.5990954Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 
2025-05-07T20:25:28.5991519Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 
2025-05-07T20:25:28.5992063Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 
2025-05-07T20:25:28.5992628Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 
2025-05-07T20:25:28.5993181Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.5993707Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.5994217Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 
2025-05-07T20:25:28.5994726Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.5995350Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 
2025-05-07T20:25:28.5995904Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:28.5996401Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 
2025-05-07T20:25:28.5996925Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:25:28.5997490Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 
2025-05-07T20:25:28.5998041Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 
2025-05-07T20:25:28.5998552Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 
2025-05-07T20:25:28.5999036Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 
2025-05-07T20:25:28.5999565Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:28.6000133Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 
2025-05-07T20:25:28.6000680Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 
2025-05-07T20:25:28.6001233Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 
2025-05-07T20:25:28.6001789Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 
2025-05-07T20:25:28.6002268Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:28.6002747Z   cuda-version       conda-forge/noarch::cuda-version-12.6-h7480c83_3 
2025-05-07T20:25:28.6003399Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 
2025-05-07T20:25:28.6003952Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:25:28.6004482Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:25:28.6004893Z   expat              conda-forge/linux-64::expat-2.7.0-h5888daf_0 
2025-05-07T20:25:28.6005415Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:25:28.6006028Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:25:28.6006626Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:25:28.6007205Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:25:28.6007713Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:25:28.6008214Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:25:28.6008713Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:25:28.6009186Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:25:28.6009622Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:25:28.6010044Z   gds-tools          conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 
2025-05-07T20:25:28.6010472Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:25:28.6010854Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:25:28.6011269Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:25:28.6011685Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:25:28.6012090Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:25:28.6012536Z   libcublas          conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 
2025-05-07T20:25:28.6013049Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 
2025-05-07T20:25:28.6013545Z   libcufft           conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 
2025-05-07T20:25:28.6014030Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 
2025-05-07T20:25:28.6014528Z   libcufile          conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 
2025-05-07T20:25:28.6015033Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 
2025-05-07T20:25:28.6015532Z   libcurand          conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 
2025-05-07T20:25:28.6016043Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 
2025-05-07T20:25:28.6016567Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 
2025-05-07T20:25:28.6017097Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 
2025-05-07T20:25:28.6017634Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 
2025-05-07T20:25:28.6018258Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 
2025-05-07T20:25:28.6018778Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:25:28.6019242Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:25:28.6019714Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:25:28.6020219Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:25:28.6020734Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:25:28.6021211Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:25:28.6021675Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:25:28.6022150Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:25:28.6022584Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:25:28.6023123Z   libnpp             conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 
2025-05-07T20:25:28.6023592Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 
2025-05-07T20:25:28.6024049Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:25:28.6024555Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:25:28.6025021Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 
2025-05-07T20:25:28.6025555Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 
2025-05-07T20:25:28.6026095Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 
2025-05-07T20:25:28.6026637Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 
2025-05-07T20:25:28.6027168Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 
2025-05-07T20:25:28.6027679Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 
2025-05-07T20:25:28.6028171Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:25:28.6028613Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:25:28.6029098Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:25:28.6029565Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:25:28.6030002Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:25:28.6039256Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:25:28.6039789Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:25:28.6040254Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:25:28.6040694Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:28.6041115Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:25:28.6041609Z   nsight-compute     conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 
2025-05-07T20:25:28.6042098Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:25:28.6042481Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:25:28.6042887Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:25:28.6043377Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:25:28.6043870Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:25:28.6044341Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:25:28.6044833Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:25:28.6045269Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:25:28.6045704Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:25:28.6046200Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:25:28.6046726Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:25:28.6047265Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:25:28.6047851Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:25:28.6048390Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:25:28.6048901Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:25:28.6049429Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:25:28.6049908Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:25:28.6050386Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:25:28.6050862Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:25:28.6051564Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:25:28.6052153Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:25:28.6052689Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:25:28.6053269Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:25:28.6053962Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:25:28.6054526Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:25:28.6055079Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:25:28.6055892Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:25:28.6056529Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:25:28.6057169Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:25:28.6057431Z 
2025-05-07T20:25:28.6057564Z The following packages will be UPDATED:
2025-05-07T20:25:28.6057772Z 
2025-05-07T20:25:28.6058126Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:25:28.6058739Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:28.6059114Z 
2025-05-07T20:25:28.6059344Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:28.6059659Z 
2025-05-07T20:25:28.6059956Z   python               pkgs/main::python-3.10.16-he870216_1 --> conda-forge::python-3.10.13-hd12c33a_1_cpython 
2025-05-07T20:25:28.6060586Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:25:28.6061160Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:25:28.6061480Z 
2025-05-07T20:25:28.6061504Z 
2025-05-07T20:25:28.6061508Z 
2025-05-07T20:25:28.6061654Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:28.6062037Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:28.6062269Z 
2025-05-07T20:25:28.6062664Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:25:28.6062903Z 
2025-05-07T20:25:28.6062907Z 
2025-05-07T20:25:28.6063121Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:25:28.6063362Z 
2025-05-07T20:25:28.6063366Z 
2025-05-07T20:25:28.6063370Z 
2025-05-07T20:25:28.6063594Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:25:28.6063857Z 
2025-05-07T20:25:28.6063860Z 
2025-05-07T20:25:28.6063864Z 
2025-05-07T20:25:28.6063868Z 
2025-05-07T20:25:28.6064100Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:28.6064497Z 
2025-05-07T20:25:28.6064500Z 
2025-05-07T20:25:28.6064504Z 
2025-05-07T20:25:28.6064508Z 
2025-05-07T20:25:28.6064511Z 
2025-05-07T20:25:28.6064754Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:28.6065024Z 
2025-05-07T20:25:28.6065028Z 
2025-05-07T20:25:28.6065031Z 
2025-05-07T20:25:28.6065035Z 
2025-05-07T20:25:28.6065039Z 
2025-05-07T20:25:28.6065045Z 
2025-05-07T20:25:28.6066027Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:28.6066312Z 
2025-05-07T20:25:28.6066316Z 
2025-05-07T20:25:28.6066320Z 
2025-05-07T20:25:28.6066323Z 
2025-05-07T20:25:28.6066327Z 
2025-05-07T20:25:28.6066330Z 
2025-05-07T20:25:28.6066338Z 
2025-05-07T20:25:28.6075864Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:28.6076170Z 
2025-05-07T20:25:28.6076174Z 
2025-05-07T20:25:28.6076178Z 
2025-05-07T20:25:28.6076181Z 
2025-05-07T20:25:28.6076185Z 
2025-05-07T20:25:28.6076188Z 
2025-05-07T20:25:28.6076192Z 
2025-05-07T20:25:28.6076195Z 
2025-05-07T20:25:28.6077176Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:28.6077541Z 
2025-05-07T20:25:28.6077762Z 
2025-05-07T20:25:28.6077767Z 
2025-05-07T20:25:28.6077770Z 
2025-05-07T20:25:28.6077774Z 
2025-05-07T20:25:28.6077778Z 
2025-05-07T20:25:28.6077781Z 
2025-05-07T20:25:28.6077785Z 
2025-05-07T20:25:28.6077905Z 
2025-05-07T20:25:28.6080125Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.6080415Z 
2025-05-07T20:25:28.6080418Z 
2025-05-07T20:25:28.6080422Z 
2025-05-07T20:25:28.6080426Z 
2025-05-07T20:25:28.6080429Z 
2025-05-07T20:25:28.6080433Z 
2025-05-07T20:25:28.6080436Z 
2025-05-07T20:25:28.6080440Z 
2025-05-07T20:25:28.6080444Z 
2025-05-07T20:25:28.6080447Z 
2025-05-07T20:25:28.6081375Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.6081707Z 
2025-05-07T20:25:28.6081711Z 
2025-05-07T20:25:28.6081715Z 
2025-05-07T20:25:28.6081718Z 
2025-05-07T20:25:28.6081722Z 
2025-05-07T20:25:28.6081739Z 
2025-05-07T20:25:28.6081742Z 
2025-05-07T20:25:28.6081746Z 
2025-05-07T20:25:28.6081750Z 
2025-05-07T20:25:28.6081761Z 
2025-05-07T20:25:28.6081765Z 
2025-05-07T20:25:28.6083272Z python-3.10.13       | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.6083613Z 
2025-05-07T20:25:28.6083616Z 
2025-05-07T20:25:28.6083620Z 
2025-05-07T20:25:28.6083623Z 
2025-05-07T20:25:28.6083626Z 
2025-05-07T20:25:28.6083630Z 
2025-05-07T20:25:28.6083633Z 
2025-05-07T20:25:28.6083636Z 
2025-05-07T20:25:28.6083640Z 
2025-05-07T20:25:28.6083643Z 
2025-05-07T20:25:28.6083647Z 
2025-05-07T20:25:28.6083650Z 
2025-05-07T20:25:28.6084857Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.6085205Z 
2025-05-07T20:25:28.6085209Z 
2025-05-07T20:25:28.6085212Z 
2025-05-07T20:25:28.6085216Z 
2025-05-07T20:25:28.6085220Z 
2025-05-07T20:25:28.6085229Z 
2025-05-07T20:25:28.6085232Z 
2025-05-07T20:25:28.6085236Z 
2025-05-07T20:25:28.6085239Z 
2025-05-07T20:25:28.6085243Z 
2025-05-07T20:25:28.6085246Z 
2025-05-07T20:25:28.6085250Z 
2025-05-07T20:25:28.6085259Z 
2025-05-07T20:25:28.6086810Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.6087112Z 
2025-05-07T20:25:28.6087123Z 
2025-05-07T20:25:28.6087126Z 
2025-05-07T20:25:28.6087130Z 
2025-05-07T20:25:28.6087134Z 
2025-05-07T20:25:28.6087143Z 
2025-05-07T20:25:28.6087147Z 
2025-05-07T20:25:28.6087151Z 
2025-05-07T20:25:28.6087154Z 
2025-05-07T20:25:28.6087158Z 
2025-05-07T20:25:28.6087162Z 
2025-05-07T20:25:28.6087165Z 
2025-05-07T20:25:28.6087169Z 
2025-05-07T20:25:28.6087173Z 
2025-05-07T20:25:28.6088243Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.6088618Z 
2025-05-07T20:25:28.6088622Z 
2025-05-07T20:25:28.6088633Z 
2025-05-07T20:25:28.6088636Z 
2025-05-07T20:25:28.6088640Z 
2025-05-07T20:25:28.6088644Z 
2025-05-07T20:25:28.6088647Z 
2025-05-07T20:25:28.6088651Z 
2025-05-07T20:25:28.6088654Z 
2025-05-07T20:25:28.6088658Z 
2025-05-07T20:25:28.6088662Z 
2025-05-07T20:25:28.6088675Z 
2025-05-07T20:25:28.6088682Z 
2025-05-07T20:25:28.6088685Z 
2025-05-07T20:25:28.6088689Z 
2025-05-07T20:25:28.6095836Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.6096165Z 
2025-05-07T20:25:28.6096169Z 
2025-05-07T20:25:28.6096173Z 
2025-05-07T20:25:28.6096176Z 
2025-05-07T20:25:28.6096180Z 
2025-05-07T20:25:28.6096183Z 
2025-05-07T20:25:28.6096187Z 
2025-05-07T20:25:28.6096190Z 
2025-05-07T20:25:28.6096201Z 
2025-05-07T20:25:28.6096204Z 
2025-05-07T20:25:28.6096208Z 
2025-05-07T20:25:28.6096211Z 
2025-05-07T20:25:28.6096215Z 
2025-05-07T20:25:28.6096219Z 
2025-05-07T20:25:28.6096222Z 
2025-05-07T20:25:28.6096226Z 
2025-05-07T20:25:28.6097398Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.6097785Z 
2025-05-07T20:25:28.6097788Z 
2025-05-07T20:25:28.6097792Z 
2025-05-07T20:25:28.6097796Z 
2025-05-07T20:25:28.6097799Z 
2025-05-07T20:25:28.6097810Z 
2025-05-07T20:25:28.6097941Z 
2025-05-07T20:25:28.6097945Z 
2025-05-07T20:25:28.6097949Z 
2025-05-07T20:25:28.6097953Z 
2025-05-07T20:25:28.6097956Z 
2025-05-07T20:25:28.6097960Z 
2025-05-07T20:25:28.6098035Z 
2025-05-07T20:25:28.6098039Z 
2025-05-07T20:25:28.6098042Z 
2025-05-07T20:25:28.6098046Z 
2025-05-07T20:25:28.6098129Z 
2025-05-07T20:25:28.6099438Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.6099768Z 
2025-05-07T20:25:28.6099771Z 
2025-05-07T20:25:28.6099775Z 
2025-05-07T20:25:28.6099787Z 
2025-05-07T20:25:28.6099790Z 
2025-05-07T20:25:28.6099794Z 
2025-05-07T20:25:28.6099805Z 
2025-05-07T20:25:28.6099808Z 
2025-05-07T20:25:28.6099812Z 
2025-05-07T20:25:28.6099815Z 
2025-05-07T20:25:28.6099819Z 
2025-05-07T20:25:28.6099822Z 
2025-05-07T20:25:28.6099826Z 
2025-05-07T20:25:28.6099829Z 
2025-05-07T20:25:28.6099833Z 
2025-05-07T20:25:28.6099837Z 
2025-05-07T20:25:28.6099840Z 
2025-05-07T20:25:28.6099844Z 
2025-05-07T20:25:28.6101007Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.6101328Z 
2025-05-07T20:25:28.6101332Z 
2025-05-07T20:25:28.6101349Z 
2025-05-07T20:25:28.6101353Z 
2025-05-07T20:25:28.6101356Z 
2025-05-07T20:25:28.6101360Z 
2025-05-07T20:25:28.6101363Z 
2025-05-07T20:25:28.6101367Z 
2025-05-07T20:25:28.6101370Z 
2025-05-07T20:25:28.6101374Z 
2025-05-07T20:25:28.6101377Z 
2025-05-07T20:25:28.6101381Z 
2025-05-07T20:25:28.6101385Z 
2025-05-07T20:25:28.6101388Z 
2025-05-07T20:25:28.6101392Z 
2025-05-07T20:25:28.6101403Z 
2025-05-07T20:25:28.6101406Z 
2025-05-07T20:25:28.6101410Z 
2025-05-07T20:25:28.6101413Z 
2025-05-07T20:25:28.6998224Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.7003231Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:25:28.7003610Z 
2025-05-07T20:25:28.7009524Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:25:28.7009804Z 
2025-05-07T20:25:28.7009822Z 
2025-05-07T20:25:28.7025709Z libcufft-11.3.0.4    | 156.2 MB  | 1          |   1% [A[A
2025-05-07T20:25:28.7026025Z 
2025-05-07T20:25:28.7026031Z 
2025-05-07T20:25:28.7026411Z 
2025-05-07T20:25:28.8002897Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:25:28.8006930Z nsight-compute-2024. | 443.1 MB  | 1          |   1% 
2025-05-07T20:25:28.8008814Z 
2025-05-07T20:25:28.8012071Z libcublas-12.6.4.1   | 256.2 MB  | 1          |   2% [A
2025-05-07T20:25:28.8012324Z 
2025-05-07T20:25:28.8013943Z 
2025-05-07T20:25:28.8027225Z libcufft-11.3.0.4    | 156.2 MB  | 4          |   5% [A[A
2025-05-07T20:25:28.8027551Z 
2025-05-07T20:25:28.8027558Z 
2025-05-07T20:25:28.8029530Z 
2025-05-07T20:25:28.9002822Z libcusparse-12.5.4.2 | 118.6 MB  | 2          |   3% [A[A[A
2025-05-07T20:25:28.9007443Z nsight-compute-2024. | 443.1 MB  | 2          |   2% 
2025-05-07T20:25:28.9007883Z 
2025-05-07T20:25:28.9015991Z libcublas-12.6.4.1   | 256.2 MB  | 3          |   4% [A
2025-05-07T20:25:28.9016286Z 
2025-05-07T20:25:28.9016513Z 
2025-05-07T20:25:28.9028846Z libcufft-11.3.0.4    | 156.2 MB  | 7          |   7% [A[A
2025-05-07T20:25:28.9029239Z 
2025-05-07T20:25:28.9029265Z 
2025-05-07T20:25:28.9031378Z 
2025-05-07T20:25:28.9550903Z libcusparse-12.5.4.2 | 118.6 MB  | 5          |   6% [A[A[A
2025-05-07T20:25:28.9551185Z 
2025-05-07T20:25:28.9551189Z 
2025-05-07T20:25:28.9551193Z 
2025-05-07T20:25:28.9551810Z 
2025-05-07T20:25:29.0007433Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:29.0007869Z nsight-compute-2024. | 443.1 MB  | 3          |   3% 
2025-05-07T20:25:29.0008112Z 
2025-05-07T20:25:29.0022136Z libcublas-12.6.4.1   | 256.2 MB  | 5          |   5% [A
2025-05-07T20:25:29.0022397Z 
2025-05-07T20:25:29.0022402Z 
2025-05-07T20:25:29.0030828Z libcufft-11.3.0.4    | 156.2 MB  | #          |  11% [A[A
2025-05-07T20:25:29.0031183Z 
2025-05-07T20:25:29.0031187Z 
2025-05-07T20:25:29.0032921Z 
2025-05-07T20:25:29.0551471Z libcusparse-12.5.4.2 | 118.6 MB  | 8          |   8% [A[A[A
2025-05-07T20:25:29.0551778Z 
2025-05-07T20:25:29.0551782Z 
2025-05-07T20:25:29.0551786Z 
2025-05-07T20:25:29.0553194Z 
2025-05-07T20:25:29.1033352Z cuda-nsight-12.6.77  | 113.2 MB  | 1          |   2% [A[A[A[A
2025-05-07T20:25:29.1033645Z 
2025-05-07T20:25:29.1033649Z 
2025-05-07T20:25:29.1034079Z 
2025-05-07T20:25:29.1110606Z libcusparse-12.5.4.2 | 118.6 MB  | #1         |  11% [A[A[A
2025-05-07T20:25:29.1110959Z 
2025-05-07T20:25:29.1241494Z libcublas-12.6.4.1   | 256.2 MB  | 6          |   7% [A
2025-05-07T20:25:29.1258713Z nsight-compute-2024. | 443.1 MB  | 4          |   4% 
2025-05-07T20:25:29.1258973Z 
2025-05-07T20:25:29.1258977Z 
2025-05-07T20:25:29.1555264Z libcufft-11.3.0.4    | 156.2 MB  | #3         |  14% [A[A
2025-05-07T20:25:29.1555696Z 
2025-05-07T20:25:29.1555701Z 
2025-05-07T20:25:29.1555705Z 
2025-05-07T20:25:29.1557145Z 
2025-05-07T20:25:29.2159295Z cuda-nsight-12.6.77  | 113.2 MB  | 4          |   5% [A[A[A[A
2025-05-07T20:25:29.2159623Z 
2025-05-07T20:25:29.2159627Z 
2025-05-07T20:25:29.2161866Z 
2025-05-07T20:25:29.2227972Z libcusparse-12.5.4.2 | 118.6 MB  | #4         |  14% [A[A[A
2025-05-07T20:25:29.2228343Z 
2025-05-07T20:25:29.2344249Z libcublas-12.6.4.1   | 256.2 MB  | 8          |   9% [A
2025-05-07T20:25:29.2465408Z nsight-compute-2024. | 443.1 MB  | 4          |   5% 
2025-05-07T20:25:29.2465672Z 
2025-05-07T20:25:29.2466869Z 
2025-05-07T20:25:29.2559049Z libcufft-11.3.0.4    | 156.2 MB  | #6         |  16% [A[A
2025-05-07T20:25:29.2559320Z 
2025-05-07T20:25:29.2559324Z 
2025-05-07T20:25:29.2559327Z 
2025-05-07T20:25:29.2560337Z 
2025-05-07T20:25:29.3228775Z cuda-nsight-12.6.77  | 113.2 MB  | 7          |   8% [A[A[A[A
2025-05-07T20:25:29.3229446Z 
2025-05-07T20:25:29.3354731Z libcublas-12.6.4.1   | 256.2 MB  | 9          |  10% [A
2025-05-07T20:25:29.3562203Z nsight-compute-2024. | 443.1 MB  | 5          |   6% 
2025-05-07T20:25:29.3562470Z 
2025-05-07T20:25:29.3562474Z 
2025-05-07T20:25:29.3562493Z 
2025-05-07T20:25:29.3562497Z 
2025-05-07T20:25:29.3593371Z cuda-nsight-12.6.77  | 113.2 MB  | #1         |  11% [A[A[A[A
2025-05-07T20:25:29.3593658Z 
2025-05-07T20:25:29.3593662Z 
2025-05-07T20:25:29.3816715Z libcufft-11.3.0.4    | 156.2 MB  | #8         |  19% [A[A
2025-05-07T20:25:29.3817070Z 
2025-05-07T20:25:29.3817074Z 
2025-05-07T20:25:29.3819245Z 
2025-05-07T20:25:29.4231609Z libcusparse-12.5.4.2 | 118.6 MB  | #6         |  17% [A[A[A
2025-05-07T20:25:29.4232643Z 
2025-05-07T20:25:29.4549611Z libcublas-12.6.4.1   | 256.2 MB  | #1         |  11% [A
2025-05-07T20:25:29.4564915Z nsight-compute-2024. | 443.1 MB  | 6          |   6% 
2025-05-07T20:25:29.4565234Z 
2025-05-07T20:25:29.4565239Z 
2025-05-07T20:25:29.4565244Z 
2025-05-07T20:25:29.4565250Z 
2025-05-07T20:25:29.4715382Z cuda-nsight-12.6.77  | 113.2 MB  | #4         |  14% [A[A[A[A
2025-05-07T20:25:29.4715768Z 
2025-05-07T20:25:29.4718511Z 
2025-05-07T20:25:29.4817605Z libcufft-11.3.0.4    | 156.2 MB  | ##1        |  21% [A[A
2025-05-07T20:25:29.4817978Z 
2025-05-07T20:25:29.4817984Z 
2025-05-07T20:25:29.4818179Z 
2025-05-07T20:25:29.5284477Z libcusparse-12.5.4.2 | 118.6 MB  | #9         |  20% [A[A[A
2025-05-07T20:25:29.5284958Z 
2025-05-07T20:25:29.5584115Z libcublas-12.6.4.1   | 256.2 MB  | #2         |  13% [A
2025-05-07T20:25:29.5584469Z 
2025-05-07T20:25:29.5584481Z 
2025-05-07T20:25:29.5584486Z 
2025-05-07T20:25:29.5584492Z 
2025-05-07T20:25:29.5627252Z cuda-nsight-12.6.77  | 113.2 MB  | #7         |  17% [A[A[A[A
2025-05-07T20:25:29.5819824Z nsight-compute-2024. | 443.1 MB  | 7          |   7% 
2025-05-07T20:25:29.5820189Z 
2025-05-07T20:25:29.5821617Z 
2025-05-07T20:25:29.5824099Z libcufft-11.3.0.4    | 156.2 MB  | ##3        |  24% [A[A
2025-05-07T20:25:29.5824453Z 
2025-05-07T20:25:29.5824458Z 
2025-05-07T20:25:29.5825781Z 
2025-05-07T20:25:29.6593088Z libcusparse-12.5.4.2 | 118.6 MB  | ##2        |  23% [A[A[A
2025-05-07T20:25:29.6593482Z 
2025-05-07T20:25:29.6593494Z 
2025-05-07T20:25:29.6593499Z 
2025-05-07T20:25:29.6596131Z 
2025-05-07T20:25:29.6630102Z cuda-nsight-12.6.77  | 113.2 MB  | ##         |  20% [A[A[A[A
2025-05-07T20:25:29.6735165Z nsight-compute-2024. | 443.1 MB  | 8          |   8% 
2025-05-07T20:25:29.6736329Z 
2025-05-07T20:25:29.6820717Z libcublas-12.6.4.1   | 256.2 MB  | #4         |  14% [A
2025-05-07T20:25:29.6821080Z 
2025-05-07T20:25:29.6821086Z 
2025-05-07T20:25:29.6821646Z 
2025-05-07T20:25:29.6863150Z libcusparse-12.5.4.2 | 118.6 MB  | ##5        |  26% [A[A[A
2025-05-07T20:25:29.6863527Z 
2025-05-07T20:25:29.6865411Z 
2025-05-07T20:25:29.7596190Z libcufft-11.3.0.4    | 156.2 MB  | ##6        |  26% [A[A
2025-05-07T20:25:29.7596554Z 
2025-05-07T20:25:29.7596560Z 
2025-05-07T20:25:29.7596566Z 
2025-05-07T20:25:29.7598527Z 
2025-05-07T20:25:29.7630742Z cuda-nsight-12.6.77  | 113.2 MB  | ##3        |  24% [A[A[A[A
2025-05-07T20:25:29.7737365Z nsight-compute-2024. | 443.1 MB  | 8          |   9% 
2025-05-07T20:25:29.7739467Z 
2025-05-07T20:25:29.7893138Z libcublas-12.6.4.1   | 256.2 MB  | #5         |  16% [A
2025-05-07T20:25:29.7893503Z 
2025-05-07T20:25:29.7893508Z 
2025-05-07T20:25:29.7893514Z 
2025-05-07T20:25:29.7972774Z libcusparse-12.5.4.2 | 118.6 MB  | ##8        |  28% [A[A[A
2025-05-07T20:25:29.7973181Z 
2025-05-07T20:25:29.7973186Z 
2025-05-07T20:25:29.8631345Z libcufft-11.3.0.4    | 156.2 MB  | ##8        |  28% [A[A
2025-05-07T20:25:29.8666861Z nsight-compute-2024. | 443.1 MB  | 9          |  10% 
2025-05-07T20:25:29.8667229Z 
2025-05-07T20:25:29.8667234Z 
2025-05-07T20:25:29.8667239Z 
2025-05-07T20:25:29.8672503Z 
2025-05-07T20:25:29.8739771Z cuda-nsight-12.6.77  | 113.2 MB  | ##6        |  27% [A[A[A[A
2025-05-07T20:25:29.8740992Z 
2025-05-07T20:25:29.8895265Z libcublas-12.6.4.1   | 256.2 MB  | #7         |  17% [A
2025-05-07T20:25:29.8895632Z 
2025-05-07T20:25:29.8895638Z 
2025-05-07T20:25:29.8895643Z 
2025-05-07T20:25:29.9052435Z libcusparse-12.5.4.2 | 118.6 MB  | ###1       |  31% [A[A[A
2025-05-07T20:25:29.9052826Z 
2025-05-07T20:25:29.9053254Z 
2025-05-07T20:25:29.9673246Z libcufft-11.3.0.4    | 156.2 MB  | ###        |  31% [A[A
2025-05-07T20:25:29.9673619Z 
2025-05-07T20:25:29.9673625Z 
2025-05-07T20:25:29.9673630Z 
2025-05-07T20:25:29.9674963Z 
2025-05-07T20:25:29.9742122Z cuda-nsight-12.6.77  | 113.2 MB  | ##9        |  30% [A[A[A[A
2025-05-07T20:25:29.9742989Z 
2025-05-07T20:25:29.9761027Z libcublas-12.6.4.1   | 256.2 MB  | #8         |  19% [A
2025-05-07T20:25:29.9898626Z nsight-compute-2024. | 443.1 MB  | #          |  10% 
2025-05-07T20:25:29.9898990Z 
2025-05-07T20:25:29.9898996Z 
2025-05-07T20:25:29.9900566Z 
2025-05-07T20:25:30.0053386Z libcusparse-12.5.4.2 | 118.6 MB  | ###4       |  34% [A[A[A
2025-05-07T20:25:30.0053764Z 
2025-05-07T20:25:30.0054601Z 
2025-05-07T20:25:30.0674174Z libcufft-11.3.0.4    | 156.2 MB  | ###2       |  33% [A[A
2025-05-07T20:25:30.0674544Z 
2025-05-07T20:25:30.0674550Z 
2025-05-07T20:25:30.0674555Z 
2025-05-07T20:25:30.0675191Z 
2025-05-07T20:25:30.0742443Z cuda-nsight-12.6.77  | 113.2 MB  | ###2       |  33% [A[A[A[A
2025-05-07T20:25:30.0743123Z 
2025-05-07T20:25:30.0765494Z libcublas-12.6.4.1   | 256.2 MB  | #9         |  20% [A
2025-05-07T20:25:30.1011984Z nsight-compute-2024. | 443.1 MB  | #1         |  11% 
2025-05-07T20:25:30.1012363Z 
2025-05-07T20:25:30.1012369Z 
2025-05-07T20:25:30.1012883Z 
2025-05-07T20:25:30.1172175Z libcusparse-12.5.4.2 | 118.6 MB  | ###7       |  37% [A[A[A
2025-05-07T20:25:30.1172482Z 
2025-05-07T20:25:30.1172487Z 
2025-05-07T20:25:30.1674495Z libcufft-11.3.0.4    | 156.2 MB  | ###5       |  35% [A[A
2025-05-07T20:25:30.1674830Z 
2025-05-07T20:25:30.1674834Z 
2025-05-07T20:25:30.1674837Z 
2025-05-07T20:25:30.1676301Z 
2025-05-07T20:25:30.1799560Z cuda-nsight-12.6.77  | 113.2 MB  | ###6       |  36% [A[A[A[A
2025-05-07T20:25:30.1799933Z 
2025-05-07T20:25:30.1972130Z libcublas-12.6.4.1   | 256.2 MB  | ##1        |  21% [A
2025-05-07T20:25:30.2013762Z nsight-compute-2024. | 443.1 MB  | #2         |  12% 
2025-05-07T20:25:30.2014061Z 
2025-05-07T20:25:30.2014349Z 
2025-05-07T20:25:30.2020832Z 
2025-05-07T20:25:30.2174025Z libcusparse-12.5.4.2 | 118.6 MB  | ####       |  40% [A[A[A
2025-05-07T20:25:30.2174301Z 
2025-05-07T20:25:30.2174305Z 
2025-05-07T20:25:30.2802842Z libcufft-11.3.0.4    | 156.2 MB  | ###7       |  37% [A[A
2025-05-07T20:25:30.2805428Z 
2025-05-07T20:25:30.2973971Z libcublas-12.6.4.1   | 256.2 MB  | ##2        |  23% [A
2025-05-07T20:25:30.3014085Z nsight-compute-2024. | 443.1 MB  | #2         |  13% 
2025-05-07T20:25:30.3014379Z 
2025-05-07T20:25:30.3014385Z 
2025-05-07T20:25:30.3014390Z 
2025-05-07T20:25:30.3175620Z libcusparse-12.5.4.2 | 118.6 MB  | ####3      |  43% [A[A[A
2025-05-07T20:25:30.3175989Z 
2025-05-07T20:25:30.3175993Z 
2025-05-07T20:25:30.3384452Z libcufft-11.3.0.4    | 156.2 MB  | ####       |  40% [A[A
2025-05-07T20:25:30.3384720Z 
2025-05-07T20:25:30.3384724Z 
2025-05-07T20:25:30.3384727Z 
2025-05-07T20:25:30.3387933Z 
2025-05-07T20:25:30.3804904Z cuda-nsight-12.6.77  | 113.2 MB  | ###9       |  39% [A[A[A[A
2025-05-07T20:25:30.3805259Z 
2025-05-07T20:25:30.3982208Z libcublas-12.6.4.1   | 256.2 MB  | ##4        |  25% [A
2025-05-07T20:25:30.4025626Z nsight-compute-2024. | 443.1 MB  | #3         |  14% 
2025-05-07T20:25:30.4025904Z 
2025-05-07T20:25:30.4025908Z 
2025-05-07T20:25:30.4027643Z 
2025-05-07T20:25:30.4212188Z libcusparse-12.5.4.2 | 118.6 MB  | ####6      |  46% [A[A[A
2025-05-07T20:25:30.4212478Z 
2025-05-07T20:25:30.4213596Z 
2025-05-07T20:25:30.4386956Z libcufft-11.3.0.4    | 156.2 MB  | ####2      |  42% [A[A
2025-05-07T20:25:30.4387228Z 
2025-05-07T20:25:30.4387240Z 
2025-05-07T20:25:30.4387244Z 
2025-05-07T20:25:30.4387248Z 
2025-05-07T20:25:30.4866036Z cuda-nsight-12.6.77  | 113.2 MB  | ####1      |  42% [A[A[A[A
2025-05-07T20:25:30.4868053Z 
2025-05-07T20:25:30.5045784Z libcublas-12.6.4.1   | 256.2 MB  | ##6        |  26% [A
2025-05-07T20:25:30.5065500Z nsight-compute-2024. | 443.1 MB  | #4         |  15% 
2025-05-07T20:25:30.5065757Z 
2025-05-07T20:25:30.5065885Z 
2025-05-07T20:25:30.5067717Z 
2025-05-07T20:25:30.5215152Z libcusparse-12.5.4.2 | 118.6 MB  | ####9      |  49% [A[A[A
2025-05-07T20:25:30.5215556Z 
2025-05-07T20:25:30.5215562Z 
2025-05-07T20:25:30.5388868Z libcufft-11.3.0.4    | 156.2 MB  | ####4      |  45% [A[A
2025-05-07T20:25:30.5389244Z 
2025-05-07T20:25:30.5389250Z 
2025-05-07T20:25:30.5389255Z 
2025-05-07T20:25:30.5390973Z 
2025-05-07T20:25:30.5902965Z cuda-nsight-12.6.77  | 113.2 MB  | ####5      |  45% [A[A[A[A
2025-05-07T20:25:30.5903365Z 
2025-05-07T20:25:30.6131109Z libcublas-12.6.4.1   | 256.2 MB  | ##7        |  27% [A
2025-05-07T20:25:30.6161155Z nsight-compute-2024. | 443.1 MB  | #5         |  15% 
2025-05-07T20:25:30.6161510Z 
2025-05-07T20:25:30.6161515Z 
2025-05-07T20:25:30.6161520Z 
2025-05-07T20:25:30.6217233Z libcusparse-12.5.4.2 | 118.6 MB  | #####2     |  52% [A[A[A
2025-05-07T20:25:30.6217623Z 
2025-05-07T20:25:30.6218453Z 
2025-05-07T20:25:30.6447290Z libcufft-11.3.0.4    | 156.2 MB  | ####7      |  47% [A[A
2025-05-07T20:25:30.6447660Z 
2025-05-07T20:25:30.6447683Z 
2025-05-07T20:25:30.6447689Z 
2025-05-07T20:25:30.6448029Z 
2025-05-07T20:25:30.6903952Z cuda-nsight-12.6.77  | 113.2 MB  | ####7      |  48% [A[A[A[A
2025-05-07T20:25:30.6904358Z 
2025-05-07T20:25:30.7162519Z libcublas-12.6.4.1   | 256.2 MB  | ##8        |  29% [A
2025-05-07T20:25:30.7162873Z 
2025-05-07T20:25:30.7162879Z 
2025-05-07T20:25:30.7164948Z 
2025-05-07T20:25:30.7206707Z libcusparse-12.5.4.2 | 118.6 MB  | #####4     |  55% [A[A[A
2025-05-07T20:25:30.7224585Z nsight-compute-2024. | 443.1 MB  | #6         |  16% 
2025-05-07T20:25:30.7224946Z 
2025-05-07T20:25:30.7226334Z 
2025-05-07T20:25:30.7450260Z libcufft-11.3.0.4    | 156.2 MB  | ####9      |  50% [A[A
2025-05-07T20:25:30.7450634Z 
2025-05-07T20:25:30.7450640Z 
2025-05-07T20:25:30.7450645Z 
2025-05-07T20:25:30.7451377Z 
2025-05-07T20:25:30.7947540Z cuda-nsight-12.6.77  | 113.2 MB  | #####      |  51% [A[A[A[A
2025-05-07T20:25:30.7948706Z 
2025-05-07T20:25:30.8209939Z libcublas-12.6.4.1   | 256.2 MB  | ###        |  30% [A
2025-05-07T20:25:30.8228222Z nsight-compute-2024. | 443.1 MB  | #6         |  17% 
2025-05-07T20:25:30.8228580Z 
2025-05-07T20:25:30.8230138Z 
2025-05-07T20:25:30.8423201Z libcufft-11.3.0.4    | 156.2 MB  | #####1     |  52% [A[A
2025-05-07T20:25:30.8423649Z 
2025-05-07T20:25:30.8423655Z 
2025-05-07T20:25:30.8423987Z 
2025-05-07T20:25:30.8453110Z libcusparse-12.5.4.2 | 118.6 MB  | #####7     |  58% [A[A[A
2025-05-07T20:25:30.8453505Z 
2025-05-07T20:25:30.8453511Z 
2025-05-07T20:25:30.8453516Z 
2025-05-07T20:25:30.8455732Z 
2025-05-07T20:25:30.8960769Z cuda-nsight-12.6.77  | 113.2 MB  | #####3     |  54% [A[A[A[A
2025-05-07T20:25:30.8961172Z 
2025-05-07T20:25:30.9211316Z libcublas-12.6.4.1   | 256.2 MB  | ###1       |  32% [A
2025-05-07T20:25:30.9290174Z nsight-compute-2024. | 443.1 MB  | #7         |  18% 
2025-05-07T20:25:30.9290528Z 
2025-05-07T20:25:30.9290534Z 
2025-05-07T20:25:30.9423737Z libcufft-11.3.0.4    | 156.2 MB  | #####4     |  54% [A[A
2025-05-07T20:25:30.9424116Z 
2025-05-07T20:25:30.9424122Z 
2025-05-07T20:25:30.9424527Z 
2025-05-07T20:25:30.9482857Z libcusparse-12.5.4.2 | 118.6 MB  | ######     |  61% [A[A[A
2025-05-07T20:25:30.9483233Z 
2025-05-07T20:25:30.9483252Z 
2025-05-07T20:25:30.9483257Z 
2025-05-07T20:25:30.9485683Z 
2025-05-07T20:25:30.9997610Z cuda-nsight-12.6.77  | 113.2 MB  | #####6     |  57% [A[A[A[A
2025-05-07T20:25:30.9998078Z 
2025-05-07T20:25:31.0214558Z libcublas-12.6.4.1   | 256.2 MB  | ###3       |  33% [A
2025-05-07T20:25:31.0291291Z nsight-compute-2024. | 443.1 MB  | #8         |  18% 
2025-05-07T20:25:31.0291640Z 
2025-05-07T20:25:31.0291992Z 
2025-05-07T20:25:31.0424727Z libcufft-11.3.0.4    | 156.2 MB  | #####6     |  57% [A[A
2025-05-07T20:25:31.0425088Z 
2025-05-07T20:25:31.0425093Z 
2025-05-07T20:25:31.0425807Z 
2025-05-07T20:25:31.0488529Z libcusparse-12.5.4.2 | 118.6 MB  | ######3    |  63% [A[A[A
2025-05-07T20:25:31.0488910Z 
2025-05-07T20:25:31.0488916Z 
2025-05-07T20:25:31.0488921Z 
2025-05-07T20:25:31.0489115Z 
2025-05-07T20:25:31.1002782Z cuda-nsight-12.6.77  | 113.2 MB  | #####9     |  60% [A[A[A[A
2025-05-07T20:25:31.1003269Z 
2025-05-07T20:25:31.1219176Z libcublas-12.6.4.1   | 256.2 MB  | ###4       |  35% [A
2025-05-07T20:25:31.1292999Z nsight-compute-2024. | 443.1 MB  | #9         |  19% 
2025-05-07T20:25:31.1293427Z 
2025-05-07T20:25:31.1293442Z 
2025-05-07T20:25:31.1488379Z libcufft-11.3.0.4    | 156.2 MB  | #####9     |  59% [A[A
2025-05-07T20:25:31.1488740Z 
2025-05-07T20:25:31.1488746Z 
2025-05-07T20:25:31.1488751Z 
2025-05-07T20:25:31.1488757Z 
2025-05-07T20:25:31.1631684Z cuda-nsight-12.6.77  | 113.2 MB  | ######2    |  63% [A[A[A[A
2025-05-07T20:25:31.1632067Z 
2025-05-07T20:25:31.1632073Z 
2025-05-07T20:25:31.1632676Z 
2025-05-07T20:25:31.2065842Z libcusparse-12.5.4.2 | 118.6 MB  | ######6    |  66% [A[A[A
2025-05-07T20:25:31.2066975Z 
2025-05-07T20:25:31.2223545Z libcublas-12.6.4.1   | 256.2 MB  | ###6       |  36% [A
2025-05-07T20:25:31.2423860Z nsight-compute-2024. | 443.1 MB  | ##         |  20% 
2025-05-07T20:25:31.2424235Z 
2025-05-07T20:25:31.2424240Z 
2025-05-07T20:25:31.2492818Z libcufft-11.3.0.4    | 156.2 MB  | ######1    |  61% [A[A
2025-05-07T20:25:31.2493178Z 
2025-05-07T20:25:31.2493197Z 
2025-05-07T20:25:31.2493203Z 
2025-05-07T20:25:31.2493208Z 
2025-05-07T20:25:31.2635602Z cuda-nsight-12.6.77  | 113.2 MB  | ######6    |  66% [A[A[A[A
2025-05-07T20:25:31.2636003Z 
2025-05-07T20:25:31.2636008Z 
2025-05-07T20:25:31.2636672Z 
2025-05-07T20:25:31.3113911Z libcusparse-12.5.4.2 | 118.6 MB  | ######9    |  69% [A[A[A
2025-05-07T20:25:31.3114920Z 
2025-05-07T20:25:31.3247601Z libcublas-12.6.4.1   | 256.2 MB  | ###7       |  38% [A
2025-05-07T20:25:31.3507203Z nsight-compute-2024. | 443.1 MB  | ##         |  21% 
2025-05-07T20:25:31.3507560Z 
2025-05-07T20:25:31.3507565Z 
2025-05-07T20:25:31.3584094Z libcufft-11.3.0.4    | 156.2 MB  | ######3    |  64% [A[A
2025-05-07T20:25:31.3584455Z 
2025-05-07T20:25:31.3584460Z 
2025-05-07T20:25:31.3584466Z 
2025-05-07T20:25:31.3584748Z 
2025-05-07T20:25:31.3639079Z cuda-nsight-12.6.77  | 113.2 MB  | ######9    |  69% [A[A[A[A
2025-05-07T20:25:31.3639467Z 
2025-05-07T20:25:31.3639473Z 
2025-05-07T20:25:31.3641116Z 
2025-05-07T20:25:31.4146358Z libcusparse-12.5.4.2 | 118.6 MB  | #######1   |  72% [A[A[A
2025-05-07T20:25:31.4147358Z 
2025-05-07T20:25:31.4253229Z libcublas-12.6.4.1   | 256.2 MB  | ###9       |  39% [A
2025-05-07T20:25:31.4584728Z nsight-compute-2024. | 443.1 MB  | ##1        |  22% 
2025-05-07T20:25:31.4585087Z 
2025-05-07T20:25:31.4585093Z 
2025-05-07T20:25:31.4585098Z 
2025-05-07T20:25:31.4587098Z 
2025-05-07T20:25:31.4625554Z cuda-nsight-12.6.77  | 113.2 MB  | #######2   |  72% [A[A[A[A
2025-05-07T20:25:31.4625956Z 
2025-05-07T20:25:31.4626444Z 
2025-05-07T20:25:31.4640907Z libcufft-11.3.0.4    | 156.2 MB  | ######5    |  66% [A[A
2025-05-07T20:25:31.4641277Z 
2025-05-07T20:25:31.4641282Z 
2025-05-07T20:25:31.4643705Z 
2025-05-07T20:25:31.5192930Z libcusparse-12.5.4.2 | 118.6 MB  | #######5   |  75% [A[A[A
2025-05-07T20:25:31.5193942Z 
2025-05-07T20:25:31.5298611Z libcublas-12.6.4.1   | 256.2 MB  | ####       |  40% [A
2025-05-07T20:25:31.5626306Z nsight-compute-2024. | 443.1 MB  | ##2        |  22% 
2025-05-07T20:25:31.5626695Z 
2025-05-07T20:25:31.5626702Z 
2025-05-07T20:25:31.5684606Z libcufft-11.3.0.4    | 156.2 MB  | ######8    |  68% [A[A
2025-05-07T20:25:31.5684969Z 
2025-05-07T20:25:31.5684975Z 
2025-05-07T20:25:31.5684980Z 
2025-05-07T20:25:31.5687687Z 
2025-05-07T20:25:31.5857968Z cuda-nsight-12.6.77  | 113.2 MB  | #######5   |  75% [A[A[A[A
2025-05-07T20:25:31.5858426Z 
2025-05-07T20:25:31.5858430Z 
2025-05-07T20:25:31.5858434Z 
2025-05-07T20:25:31.6224289Z libcusparse-12.5.4.2 | 118.6 MB  | #######7   |  78% [A[A[A
2025-05-07T20:25:31.6227751Z 
2025-05-07T20:25:31.6298660Z libcublas-12.6.4.1   | 256.2 MB  | ####1      |  42% [A
2025-05-07T20:25:31.6626688Z nsight-compute-2024. | 443.1 MB  | ##3        |  23% 
2025-05-07T20:25:31.6627059Z 
2025-05-07T20:25:31.6629160Z 
2025-05-07T20:25:31.6745860Z libcufft-11.3.0.4    | 156.2 MB  | #######    |  70% [A[A
2025-05-07T20:25:31.6746174Z 
2025-05-07T20:25:31.6746177Z 
2025-05-07T20:25:31.6746194Z 
2025-05-07T20:25:31.6749117Z 
2025-05-07T20:25:31.6861728Z cuda-nsight-12.6.77  | 113.2 MB  | #######7   |  78% [A[A[A[A
2025-05-07T20:25:31.6862382Z 
2025-05-07T20:25:31.6862386Z 
2025-05-07T20:25:31.6863011Z 
2025-05-07T20:25:31.7227914Z libcusparse-12.5.4.2 | 118.6 MB  | ########   |  81% [A[A[A
2025-05-07T20:25:31.7229984Z 
2025-05-07T20:25:31.7427155Z libcublas-12.6.4.1   | 256.2 MB  | ####3      |  43% [A
2025-05-07T20:25:31.7630145Z nsight-compute-2024. | 443.1 MB  | ##4        |  24% 
2025-05-07T20:25:31.7630532Z 
2025-05-07T20:25:31.7630539Z 
2025-05-07T20:25:31.7748914Z libcufft-11.3.0.4    | 156.2 MB  | #######2   |  73% [A[A
2025-05-07T20:25:31.7749228Z 
2025-05-07T20:25:31.7749232Z 
2025-05-07T20:25:31.7749236Z 
2025-05-07T20:25:31.7749950Z 
2025-05-07T20:25:31.7894803Z cuda-nsight-12.6.77  | 113.2 MB  | ########1  |  81% [A[A[A[A
2025-05-07T20:25:31.7895110Z 
2025-05-07T20:25:31.7895114Z 
2025-05-07T20:25:31.7899025Z 
2025-05-07T20:25:31.8232440Z libcusparse-12.5.4.2 | 118.6 MB  | ########3  |  83% [A[A[A
2025-05-07T20:25:31.8233339Z 
2025-05-07T20:25:31.8428457Z libcublas-12.6.4.1   | 256.2 MB  | ####4      |  45% [A
2025-05-07T20:25:31.8632435Z nsight-compute-2024. | 443.1 MB  | ##4        |  25% 
2025-05-07T20:25:31.8632703Z 
2025-05-07T20:25:31.8632707Z 
2025-05-07T20:25:31.8752740Z libcufft-11.3.0.4    | 156.2 MB  | #######5   |  75% [A[A
2025-05-07T20:25:31.8753004Z 
2025-05-07T20:25:31.8753008Z 
2025-05-07T20:25:31.8753012Z 
2025-05-07T20:25:31.8755354Z 
2025-05-07T20:25:31.9232663Z cuda-nsight-12.6.77  | 113.2 MB  | ########4  |  84% [A[A[A[A
2025-05-07T20:25:31.9233013Z 
2025-05-07T20:25:31.9433302Z libcublas-12.6.4.1   | 256.2 MB  | ####6      |  46% [A
2025-05-07T20:25:31.9633561Z nsight-compute-2024. | 443.1 MB  | ##5        |  26% 
2025-05-07T20:25:31.9633925Z 
2025-05-07T20:25:31.9636305Z 
2025-05-07T20:25:31.9755449Z libcufft-11.3.0.4    | 156.2 MB  | #######8   |  78% [A[A
2025-05-07T20:25:31.9755963Z 
2025-05-07T20:25:31.9755967Z 
2025-05-07T20:25:31.9755970Z 
2025-05-07T20:25:31.9756193Z 
2025-05-07T20:25:31.9937784Z cuda-nsight-12.6.77  | 113.2 MB  | ########8  |  88% [A[A[A[A
2025-05-07T20:25:31.9938167Z 
2025-05-07T20:25:31.9938171Z 
2025-05-07T20:25:31.9939596Z 
2025-05-07T20:25:32.0233560Z libcusparse-12.5.4.2 | 118.6 MB  | ########6  |  86% [A[A[A
2025-05-07T20:25:32.0235634Z 
2025-05-07T20:25:32.0437142Z libcublas-12.6.4.1   | 256.2 MB  | ####7      |  48% [A
2025-05-07T20:25:32.0778463Z nsight-compute-2024. | 443.1 MB  | ##6        |  27% 
2025-05-07T20:25:32.0778768Z 
2025-05-07T20:25:32.0778772Z 
2025-05-07T20:25:32.0939323Z libcufft-11.3.0.4    | 156.2 MB  | ########   |  81% [A[A
2025-05-07T20:25:32.0939624Z 
2025-05-07T20:25:32.0939630Z 
2025-05-07T20:25:32.0940325Z 
2025-05-07T20:25:32.1053191Z libcusparse-12.5.4.2 | 118.6 MB  | ########8  |  88% [A[A[A
2025-05-07T20:25:32.1053501Z 
2025-05-07T20:25:32.1053507Z 
2025-05-07T20:25:32.1053513Z 
2025-05-07T20:25:32.1054167Z 
2025-05-07T20:25:32.1255901Z cuda-nsight-12.6.77  | 113.2 MB  | #########1 |  91% [A[A[A[A
2025-05-07T20:25:32.1259269Z 
2025-05-07T20:25:32.1443833Z libcublas-12.6.4.1   | 256.2 MB  | ####9      |  49% [A
2025-05-07T20:25:32.1867162Z nsight-compute-2024. | 443.1 MB  | ##7        |  27% 
2025-05-07T20:25:32.1867433Z 
2025-05-07T20:25:32.1870549Z 
2025-05-07T20:25:32.1995334Z libcufft-11.3.0.4    | 156.2 MB  | ########3  |  83% [A[A
2025-05-07T20:25:32.1995604Z 
2025-05-07T20:25:32.1995608Z 
2025-05-07T20:25:32.1995611Z 
2025-05-07T20:25:32.2055250Z libcusparse-12.5.4.2 | 118.6 MB  | #########  |  91% [A[A[A
2025-05-07T20:25:32.2055844Z 
2025-05-07T20:25:32.2055848Z 
2025-05-07T20:25:32.2055852Z 
2025-05-07T20:25:32.2055855Z 
2025-05-07T20:25:32.2357350Z cuda-nsight-12.6.77  | 113.2 MB  | #########4 |  95% [A[A[A[A
2025-05-07T20:25:32.2357710Z 
2025-05-07T20:25:32.2530609Z libcublas-12.6.4.1   | 256.2 MB  | #####      |  51% [A
2025-05-07T20:25:32.2998126Z nsight-compute-2024. | 443.1 MB  | ##8        |  28% 
2025-05-07T20:25:32.2998532Z 
2025-05-07T20:25:32.2998538Z 
2025-05-07T20:25:32.3002221Z 
2025-05-07T20:25:32.3050019Z libcusparse-12.5.4.2 | 118.6 MB  | #########3 |  94% [A[A[A
2025-05-07T20:25:32.3050306Z 
2025-05-07T20:25:32.3050972Z 
2025-05-07T20:25:32.3058335Z libcufft-11.3.0.4    | 156.2 MB  | ########5  |  86% [A[A
2025-05-07T20:25:32.3058609Z 
2025-05-07T20:25:32.3058613Z 
2025-05-07T20:25:32.3058617Z 
2025-05-07T20:25:32.3058621Z 
2025-05-07T20:25:32.3384868Z cuda-nsight-12.6.77  | 113.2 MB  | #########7 |  98% [A[A[A[A
2025-05-07T20:25:32.3385170Z 
2025-05-07T20:25:32.3532082Z libcublas-12.6.4.1   | 256.2 MB  | #####2     |  52% [A
2025-05-07T20:25:32.3998560Z nsight-compute-2024. | 443.1 MB  | ##9        |  29% 
2025-05-07T20:25:32.3998851Z 
2025-05-07T20:25:32.3998855Z 
2025-05-07T20:25:32.3999407Z 
2025-05-07T20:25:32.4118597Z libcusparse-12.5.4.2 | 118.6 MB  | #########6 |  96% [A[A[A
2025-05-07T20:25:32.4118951Z 
2025-05-07T20:25:32.4118957Z 
2025-05-07T20:25:32.4463994Z libcufft-11.3.0.4    | 156.2 MB  | ########8  |  88% [A[A
2025-05-07T20:25:32.4464278Z 
2025-05-07T20:25:32.4535193Z libcublas-12.6.4.1   | 256.2 MB  | #####3     |  54% [A
2025-05-07T20:25:32.5039385Z nsight-compute-2024. | 443.1 MB  | ##9        |  30% 
2025-05-07T20:25:32.5039794Z 
2025-05-07T20:25:32.5039799Z 
2025-05-07T20:25:32.5041446Z 
2025-05-07T20:25:32.5119093Z libcusparse-12.5.4.2 | 118.6 MB  | #########8 |  99% [A[A[A
2025-05-07T20:25:32.5119388Z 
2025-05-07T20:25:32.5123533Z 
2025-05-07T20:25:32.5537920Z libcufft-11.3.0.4    | 156.2 MB  | #########  |  90% [A[A
2025-05-07T20:25:32.5538811Z 
2025-05-07T20:25:32.5545239Z libcublas-12.6.4.1   | 256.2 MB  | #####5     |  55% [A
2025-05-07T20:25:32.6120756Z nsight-compute-2024. | 443.1 MB  | ###        |  31% 
2025-05-07T20:25:32.6121023Z 
2025-05-07T20:25:32.6121026Z 
2025-05-07T20:25:32.6538474Z libcufft-11.3.0.4    | 156.2 MB  | #########2 |  93% [A[A
2025-05-07T20:25:32.6539458Z 
2025-05-07T20:25:32.6544234Z libcublas-12.6.4.1   | 256.2 MB  | #####6     |  57% [A
2025-05-07T20:25:32.7121787Z nsight-compute-2024. | 443.1 MB  | ###1       |  32% 
2025-05-07T20:25:32.7122246Z 
2025-05-07T20:25:32.7122578Z 
2025-05-07T20:25:32.7544109Z libcufft-11.3.0.4    | 156.2 MB  | #########5 |  95% [A[A
2025-05-07T20:25:32.7547116Z 
2025-05-07T20:25:32.7550883Z libcublas-12.6.4.1   | 256.2 MB  | #####8     |  58% [A
2025-05-07T20:25:32.8125109Z nsight-compute-2024. | 443.1 MB  | ###2       |  33% 
2025-05-07T20:25:32.8125378Z 
2025-05-07T20:25:32.8125649Z 
2025-05-07T20:25:32.8547598Z libcufft-11.3.0.4    | 156.2 MB  | #########7 |  97% [A[A
2025-05-07T20:25:32.8547870Z 
2025-05-07T20:25:32.8552035Z libcublas-12.6.4.1   | 256.2 MB  | #####9     |  60% [A
2025-05-07T20:25:32.9184425Z nsight-compute-2024. | 443.1 MB  | ###3       |  34% 
2025-05-07T20:25:32.9184681Z 
2025-05-07T20:25:32.9184684Z 
2025-05-07T20:25:32.9549873Z libcufft-11.3.0.4    | 156.2 MB  | #########9 | 100% [A[A
2025-05-07T20:25:32.9551904Z 
2025-05-07T20:25:32.9556427Z libcublas-12.6.4.1   | 256.2 MB  | ######1    |  61% [A
2025-05-07T20:25:33.0550880Z nsight-compute-2024. | 443.1 MB  | ###4       |  35% 
2025-05-07T20:25:33.0551817Z 
2025-05-07T20:25:33.0558592Z libcublas-12.6.4.1   | 256.2 MB  | ######2    |  63% [A
2025-05-07T20:25:33.1551326Z nsight-compute-2024. | 443.1 MB  | ###5       |  36% 
2025-05-07T20:25:33.1552291Z 
2025-05-07T20:25:33.1593646Z libcublas-12.6.4.1   | 256.2 MB  | ######4    |  65% [A
2025-05-07T20:25:33.2551136Z nsight-compute-2024. | 443.1 MB  | ###7       |  37% 
2025-05-07T20:25:33.2552735Z 
2025-05-07T20:25:33.2870569Z libcublas-12.6.4.1   | 256.2 MB  | ######6    |  67% [A
2025-05-07T20:25:33.3554746Z nsight-compute-2024. | 443.1 MB  | ###8       |  38% 
2025-05-07T20:25:33.3556086Z 
2025-05-07T20:25:33.3871332Z libcublas-12.6.4.1   | 256.2 MB  | ######8    |  69% [A
2025-05-07T20:25:33.4560908Z nsight-compute-2024. | 443.1 MB  | ###9       |  39% 
2025-05-07T20:25:33.4562172Z 
2025-05-07T20:25:33.4872839Z libcublas-12.6.4.1   | 256.2 MB  | #######    |  70% [A
2025-05-07T20:25:33.5629862Z nsight-compute-2024. | 443.1 MB  | ####       |  40% 
2025-05-07T20:25:33.5630258Z 
2025-05-07T20:25:33.5875550Z libcublas-12.6.4.1   | 256.2 MB  | #######2   |  72% [A
2025-05-07T20:25:33.6630558Z nsight-compute-2024. | 443.1 MB  | ####1      |  42% 
2025-05-07T20:25:33.6630927Z 
2025-05-07T20:25:33.6875085Z libcublas-12.6.4.1   | 256.2 MB  | #######4   |  74% [A
2025-05-07T20:25:33.7683441Z nsight-compute-2024. | 443.1 MB  | ####2      |  43% 
2025-05-07T20:25:33.7686338Z 
2025-05-07T20:25:33.7875624Z libcublas-12.6.4.1   | 256.2 MB  | #######5   |  76% [A
2025-05-07T20:25:33.8722861Z nsight-compute-2024. | 443.1 MB  | ####3      |  44% 
2025-05-07T20:25:33.8723310Z 
2025-05-07T20:25:33.8879307Z libcublas-12.6.4.1   | 256.2 MB  | #######7   |  78% [A
2025-05-07T20:25:33.9724839Z nsight-compute-2024. | 443.1 MB  | ####5      |  45% 
2025-05-07T20:25:33.9725196Z 
2025-05-07T20:25:33.9894328Z libcublas-12.6.4.1   | 256.2 MB  | #######9   |  79% [A
2025-05-07T20:25:34.0777302Z nsight-compute-2024. | 443.1 MB  | ####6      |  46% 
2025-05-07T20:25:34.0777596Z 
2025-05-07T20:25:34.0903759Z libcublas-12.6.4.1   | 256.2 MB  | ########1  |  81% [A
2025-05-07T20:25:34.1778154Z nsight-compute-2024. | 443.1 MB  | ####7      |  47% 
2025-05-07T20:25:34.1778492Z 
2025-05-07T20:25:34.1904225Z libcublas-12.6.4.1   | 256.2 MB  | ########2  |  83% [A
2025-05-07T20:25:34.2780022Z nsight-compute-2024. | 443.1 MB  | ####8      |  49% 
2025-05-07T20:25:34.2780370Z 
2025-05-07T20:25:34.2942290Z libcublas-12.6.4.1   | 256.2 MB  | ########4  |  85% [A
2025-05-07T20:25:34.3782427Z nsight-compute-2024. | 443.1 MB  | ####9      |  50% 
2025-05-07T20:25:34.3782693Z 
2025-05-07T20:25:34.3996610Z libcublas-12.6.4.1   | 256.2 MB  | ########6  |  87% [A
2025-05-07T20:25:34.4784613Z nsight-compute-2024. | 443.1 MB  | #####      |  51% 
2025-05-07T20:25:34.4785451Z 
2025-05-07T20:25:34.4998835Z libcublas-12.6.4.1   | 256.2 MB  | ########8  |  88% [A
2025-05-07T20:25:34.5660251Z nsight-compute-2024. | 443.1 MB  | #####2     |  52% 
2025-05-07T20:25:34.5660654Z 
2025-05-07T20:25:34.5660929Z 
2025-05-07T20:25:34.5660935Z 
2025-05-07T20:25:34.5665874Z 
2025-05-07T20:25:34.5800239Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:34.5800564Z 
2025-05-07T20:25:34.6089607Z libcublas-12.6.4.1   | 256.2 MB  | #########  |  90% [A
2025-05-07T20:25:34.6284334Z nsight-compute-2024. | 443.1 MB  | #####3     |  53% 
2025-05-07T20:25:34.6284610Z 
2025-05-07T20:25:34.6284614Z 
2025-05-07T20:25:34.6284618Z 
2025-05-07T20:25:34.6284622Z 
2025-05-07T20:25:34.6286007Z 
2025-05-07T20:25:34.6949733Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:34.6952170Z 
2025-05-07T20:25:34.7284774Z libcublas-12.6.4.1   | 256.2 MB  | #########1 |  92% [A
2025-05-07T20:25:34.7285054Z 
2025-05-07T20:25:34.7285058Z 
2025-05-07T20:25:34.7285061Z 
2025-05-07T20:25:34.7285082Z 
2025-05-07T20:25:34.7285278Z 
2025-05-07T20:25:34.7306909Z cuda-nvvp-12.6.80    | 109.3 MB  | 2          |   2% [A[A[A[A[A
2025-05-07T20:25:34.8228983Z nsight-compute-2024. | 443.1 MB  | #####4     |  54% 
2025-05-07T20:25:34.8234352Z 
2025-05-07T20:25:34.8287943Z libcublas-12.6.4.1   | 256.2 MB  | #########3 |  94% [A
2025-05-07T20:25:34.8288209Z 
2025-05-07T20:25:34.8288213Z 
2025-05-07T20:25:34.8288217Z 
2025-05-07T20:25:34.8288221Z 
2025-05-07T20:25:34.8288888Z 
2025-05-07T20:25:34.8614425Z cuda-nvvp-12.6.80    | 109.3 MB  | 5          |   6% [A[A[A[A[A
2025-05-07T20:25:34.9288611Z nsight-compute-2024. | 443.1 MB  | #####5     |  55% 
2025-05-07T20:25:34.9288872Z 
2025-05-07T20:25:34.9288882Z 
2025-05-07T20:25:34.9288886Z 
2025-05-07T20:25:34.9288889Z 
2025-05-07T20:25:34.9291010Z 
2025-05-07T20:25:34.9434057Z cuda-nvvp-12.6.80    | 109.3 MB  | 8          |   9% [A[A[A[A[A
2025-05-07T20:25:34.9434372Z 
2025-05-07T20:25:34.9892279Z libcublas-12.6.4.1   | 256.2 MB  | #########5 |  95% [A
2025-05-07T20:25:35.0293290Z nsight-compute-2024. | 443.1 MB  | #####6     |  56% 
2025-05-07T20:25:35.0293666Z 
2025-05-07T20:25:35.0293672Z 
2025-05-07T20:25:35.0293689Z 
2025-05-07T20:25:35.0293694Z 
2025-05-07T20:25:35.0296567Z 
2025-05-07T20:25:35.0497557Z cuda-nvvp-12.6.80    | 109.3 MB  | #1         |  12% [A[A[A[A[A
2025-05-07T20:25:35.0498693Z 
2025-05-07T20:25:35.0927530Z libcublas-12.6.4.1   | 256.2 MB  | #########6 |  97% [A
2025-05-07T20:25:35.1294376Z nsight-compute-2024. | 443.1 MB  | #####7     |  57% 
2025-05-07T20:25:35.1294726Z 
2025-05-07T20:25:35.1294730Z 
2025-05-07T20:25:35.1294734Z 
2025-05-07T20:25:35.1294738Z 
2025-05-07T20:25:35.1297136Z 
2025-05-07T20:25:35.1531500Z cuda-nvvp-12.6.80    | 109.3 MB  | #5         |  15% [A[A[A[A[A
2025-05-07T20:25:35.1534038Z 
2025-05-07T20:25:35.1957803Z libcublas-12.6.4.1   | 256.2 MB  | #########8 |  98% [A
2025-05-07T20:25:35.2294653Z nsight-compute-2024. | 443.1 MB  | #####8     |  58% 
2025-05-07T20:25:35.2295021Z 
2025-05-07T20:25:35.2295025Z 
2025-05-07T20:25:35.2295029Z 
2025-05-07T20:25:35.2295033Z 
2025-05-07T20:25:35.2297890Z 
2025-05-07T20:25:35.2532553Z cuda-nvvp-12.6.80    | 109.3 MB  | #8         |  19% [A[A[A[A[A
2025-05-07T20:25:35.2533913Z 
2025-05-07T20:25:35.2984006Z libcublas-12.6.4.1   | 256.2 MB  | #########9 | 100% [A
2025-05-07T20:25:35.3295290Z nsight-compute-2024. | 443.1 MB  | #####9     |  59% 
2025-05-07T20:25:35.3295645Z 
2025-05-07T20:25:35.3295649Z 
2025-05-07T20:25:35.3295653Z 
2025-05-07T20:25:35.3295656Z 
2025-05-07T20:25:35.3296471Z 
2025-05-07T20:25:35.3492615Z cuda-nvvp-12.6.80    | 109.3 MB  | ##2        |  23% [A[A[A[A[A
2025-05-07T20:25:35.3492950Z 
2025-05-07T20:25:35.3492954Z 
2025-05-07T20:25:35.3494339Z 
2025-05-07T20:25:35.3991784Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:35.3992117Z 
2025-05-07T20:25:35.3992120Z 
2025-05-07T20:25:35.3992124Z 
2025-05-07T20:25:35.3992128Z 
2025-05-07T20:25:35.3992132Z 
2025-05-07T20:25:35.3993038Z 
2025-05-07T20:25:35.4142861Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:35.4406455Z nsight-compute-2024. | 443.1 MB  | #####9     |  60% 
2025-05-07T20:25:35.4406996Z 
2025-05-07T20:25:35.4407000Z 
2025-05-07T20:25:35.4407003Z 
2025-05-07T20:25:35.4407007Z 
2025-05-07T20:25:35.4407019Z 
2025-05-07T20:25:35.4991760Z cuda-nvvp-12.6.80    | 109.3 MB  | ##6        |  26% [A[A[A[A[A
2025-05-07T20:25:35.4992061Z 
2025-05-07T20:25:35.4992065Z 
2025-05-07T20:25:35.4992068Z 
2025-05-07T20:25:35.4992072Z 
2025-05-07T20:25:35.4992085Z 
2025-05-07T20:25:35.4992192Z 
2025-05-07T20:25:35.5276896Z libcusolver-11.7.1.2 | 95.8 MB   | 3          |   3% [A[A[A[A[A[A
2025-05-07T20:25:35.5541514Z nsight-compute-2024. | 443.1 MB  | ######     |  61% 
2025-05-07T20:25:35.5541824Z 
2025-05-07T20:25:35.5541828Z 
2025-05-07T20:25:35.5541832Z 
2025-05-07T20:25:35.5541836Z 
2025-05-07T20:25:35.5541839Z 
2025-05-07T20:25:35.5995420Z cuda-nvvp-12.6.80    | 109.3 MB  | ##9        |  29% [A[A[A[A[A
2025-05-07T20:25:35.5995718Z 
2025-05-07T20:25:35.5995722Z 
2025-05-07T20:25:35.5995725Z 
2025-05-07T20:25:35.5995729Z 
2025-05-07T20:25:35.5995732Z 
2025-05-07T20:25:35.5995744Z 
2025-05-07T20:25:35.6453828Z libcusolver-11.7.1.2 | 95.8 MB   | 6          |   7% [A[A[A[A[A[A
2025-05-07T20:25:35.6568862Z nsight-compute-2024. | 443.1 MB  | ######1    |  62% 
2025-05-07T20:25:35.6569115Z 
2025-05-07T20:25:35.6569119Z 
2025-05-07T20:25:35.6569122Z 
2025-05-07T20:25:35.6569126Z 
2025-05-07T20:25:35.6570541Z 
2025-05-07T20:25:35.6995681Z cuda-nvvp-12.6.80    | 109.3 MB  | ###2       |  33% [A[A[A[A[A
2025-05-07T20:25:35.6995965Z 
2025-05-07T20:25:35.6995969Z 
2025-05-07T20:25:35.6995973Z 
2025-05-07T20:25:35.6995976Z 
2025-05-07T20:25:35.6995980Z 
2025-05-07T20:25:35.6995983Z 
2025-05-07T20:25:35.7614115Z libcusolver-11.7.1.2 | 95.8 MB   | 9          |  10% [A[A[A[A[A[A
2025-05-07T20:25:35.7614414Z 
2025-05-07T20:25:35.7614417Z 
2025-05-07T20:25:35.7614421Z 
2025-05-07T20:25:35.7614436Z 
2025-05-07T20:25:35.7615906Z 
2025-05-07T20:25:35.7622307Z cuda-nvvp-12.6.80    | 109.3 MB  | ###5       |  36% [A[A[A[A[A
2025-05-07T20:25:35.7997672Z nsight-compute-2024. | 443.1 MB  | ######2    |  62% 
2025-05-07T20:25:35.8006003Z 
2025-05-07T20:25:35.8006471Z 
2025-05-07T20:25:35.8006478Z 
2025-05-07T20:25:35.8006484Z 
2025-05-07T20:25:35.8006490Z 
2025-05-07T20:25:35.8006495Z 
2025-05-07T20:25:35.8667371Z libcusolver-11.7.1.2 | 95.8 MB   | #3         |  13% [A[A[A[A[A[A
2025-05-07T20:25:35.8667792Z 
2025-05-07T20:25:35.8667797Z 
2025-05-07T20:25:35.8667802Z 
2025-05-07T20:25:35.8667807Z 
2025-05-07T20:25:35.8669302Z 
2025-05-07T20:25:35.8734176Z cuda-nvvp-12.6.80    | 109.3 MB  | ###8       |  39% [A[A[A[A[A
2025-05-07T20:25:35.9024050Z nsight-compute-2024. | 443.1 MB  | ######3    |  63% 
2025-05-07T20:25:35.9024420Z 
2025-05-07T20:25:35.9024425Z 
2025-05-07T20:25:35.9024431Z 
2025-05-07T20:25:35.9024436Z 
2025-05-07T20:25:35.9024441Z 
2025-05-07T20:25:35.9024447Z 
2025-05-07T20:25:35.9672819Z libcusolver-11.7.1.2 | 95.8 MB   | #6         |  16% [A[A[A[A[A[A
2025-05-07T20:25:35.9673262Z 
2025-05-07T20:25:35.9673269Z 
2025-05-07T20:25:35.9673274Z 
2025-05-07T20:25:35.9673291Z 
2025-05-07T20:25:35.9673297Z 
2025-05-07T20:25:35.9834053Z cuda-nvvp-12.6.80    | 109.3 MB  | ####1      |  42% [A[A[A[A[A
2025-05-07T20:25:36.0030922Z nsight-compute-2024. | 443.1 MB  | ######3    |  64% 
2025-05-07T20:25:36.0031289Z 
2025-05-07T20:25:36.0031445Z 
2025-05-07T20:25:36.0031452Z 
2025-05-07T20:25:36.0031474Z 
2025-05-07T20:25:36.0031508Z 
2025-05-07T20:25:36.0034371Z 
2025-05-07T20:25:36.0768321Z libcusolver-11.7.1.2 | 95.8 MB   | #9         |  20% [A[A[A[A[A[A
2025-05-07T20:25:36.0768736Z 
2025-05-07T20:25:36.0768742Z 
2025-05-07T20:25:36.0768746Z 
2025-05-07T20:25:36.0768752Z 
2025-05-07T20:25:36.0772620Z 
2025-05-07T20:25:36.0870995Z cuda-nvvp-12.6.80    | 109.3 MB  | ####4      |  45% [A[A[A[A[A
2025-05-07T20:25:36.1037258Z nsight-compute-2024. | 443.1 MB  | ######4    |  65% 
2025-05-07T20:25:36.1037635Z 
2025-05-07T20:25:36.1037640Z 
2025-05-07T20:25:36.1037646Z 
2025-05-07T20:25:36.1037651Z 
2025-05-07T20:25:36.1037656Z 
2025-05-07T20:25:36.1037834Z 
2025-05-07T20:25:36.1822256Z libcusolver-11.7.1.2 | 95.8 MB   | ##2        |  23% [A[A[A[A[A[A
2025-05-07T20:25:36.1822609Z 
2025-05-07T20:25:36.1822613Z 
2025-05-07T20:25:36.1822617Z 
2025-05-07T20:25:36.1822620Z 
2025-05-07T20:25:36.1824106Z 
2025-05-07T20:25:36.1897377Z cuda-nvvp-12.6.80    | 109.3 MB  | ####7      |  48% [A[A[A[A[A
2025-05-07T20:25:36.2050393Z nsight-compute-2024. | 443.1 MB  | ######5    |  65% 
2025-05-07T20:25:36.2050791Z 
2025-05-07T20:25:36.2050798Z 
2025-05-07T20:25:36.2050805Z 
2025-05-07T20:25:36.2050812Z 
2025-05-07T20:25:36.2050818Z 
2025-05-07T20:25:36.2050824Z 
2025-05-07T20:25:36.2825953Z libcusolver-11.7.1.2 | 95.8 MB   | ##5        |  26% [A[A[A[A[A[A
2025-05-07T20:25:36.2826378Z 
2025-05-07T20:25:36.2826384Z 
2025-05-07T20:25:36.2826389Z 
2025-05-07T20:25:36.2826409Z 
2025-05-07T20:25:36.2828608Z 
2025-05-07T20:25:36.2900917Z cuda-nvvp-12.6.80    | 109.3 MB  | #####      |  51% [A[A[A[A[A
2025-05-07T20:25:36.3054700Z nsight-compute-2024. | 443.1 MB  | ######5    |  66% 
2025-05-07T20:25:36.3055081Z 
2025-05-07T20:25:36.3055087Z 
2025-05-07T20:25:36.3055092Z 
2025-05-07T20:25:36.3055098Z 
2025-05-07T20:25:36.3055103Z 
2025-05-07T20:25:36.3055119Z 
2025-05-07T20:25:36.3846140Z libcusolver-11.7.1.2 | 95.8 MB   | ##9        |  29% [A[A[A[A[A[A
2025-05-07T20:25:36.3846496Z 
2025-05-07T20:25:36.3846500Z 
2025-05-07T20:25:36.3846504Z 
2025-05-07T20:25:36.3846507Z 
2025-05-07T20:25:36.3847919Z 
2025-05-07T20:25:36.3902002Z cuda-nvvp-12.6.80    | 109.3 MB  | #####3     |  54% [A[A[A[A[A
2025-05-07T20:25:36.4056978Z nsight-compute-2024. | 443.1 MB  | ######6    |  67% 
2025-05-07T20:25:36.4057289Z 
2025-05-07T20:25:36.4057295Z 
2025-05-07T20:25:36.4057304Z 
2025-05-07T20:25:36.4057416Z 
2025-05-07T20:25:36.4057423Z 
2025-05-07T20:25:36.4057439Z 
2025-05-07T20:25:36.4706367Z libcusolver-11.7.1.2 | 95.8 MB   | ###2       |  32% [A[A[A[A[A[A
2025-05-07T20:25:36.4706678Z 
2025-05-07T20:25:36.4706682Z 
2025-05-07T20:25:36.4910918Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:25:36.4974499Z nsight-compute-2024. | 443.1 MB  | ######7    |  67% 
2025-05-07T20:25:36.4974773Z 
2025-05-07T20:25:36.4974777Z 
2025-05-07T20:25:36.4974780Z 
2025-05-07T20:25:36.4974784Z 
2025-05-07T20:25:36.4974788Z 
2025-05-07T20:25:36.5191562Z cuda-nvvp-12.6.80    | 109.3 MB  | #####6     |  57% [A[A[A[A[A
2025-05-07T20:25:36.5191961Z 
2025-05-07T20:25:36.5191965Z 
2025-05-07T20:25:36.5191969Z 
2025-05-07T20:25:36.5191972Z 
2025-05-07T20:25:36.5191976Z 
2025-05-07T20:25:36.5191980Z 
2025-05-07T20:25:36.5343905Z libcusolver-11.7.1.2 | 95.8 MB   | ###5       |  36% [A[A[A[A[A[A
2025-05-07T20:25:36.5344217Z 
2025-05-07T20:25:36.5344221Z 
2025-05-07T20:25:36.5344225Z 
2025-05-07T20:25:36.5344228Z 
2025-05-07T20:25:36.5344232Z 
2025-05-07T20:25:36.5344252Z 
2025-05-07T20:25:36.5347714Z 
2025-05-07T20:25:36.6064998Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:36.6084216Z nsight-compute-2024. | 443.1 MB  | ######8    |  68% 
2025-05-07T20:25:36.6084517Z 
2025-05-07T20:25:36.6084522Z 
2025-05-07T20:25:36.6084525Z 
2025-05-07T20:25:36.6084529Z 
2025-05-07T20:25:36.6084533Z 
2025-05-07T20:25:36.6345856Z cuda-nvvp-12.6.80    | 109.3 MB  | #####9     |  60% [A[A[A[A[A
2025-05-07T20:25:36.6346156Z 
2025-05-07T20:25:36.6346160Z 
2025-05-07T20:25:36.6346164Z 
2025-05-07T20:25:36.6346168Z 
2025-05-07T20:25:36.6346171Z 
2025-05-07T20:25:36.6346175Z 
2025-05-07T20:25:36.6353131Z 
2025-05-07T20:25:36.6356859Z libnpp-12.3.1.54     | 93.4 MB   | 2          |   3% [A[A[A[A[A[A[A
2025-05-07T20:25:36.6357253Z 
2025-05-07T20:25:36.6357257Z 
2025-05-07T20:25:36.6357261Z 
2025-05-07T20:25:36.6357264Z 
2025-05-07T20:25:36.6357268Z 
2025-05-07T20:25:36.6357271Z 
2025-05-07T20:25:36.7111253Z libcusolver-11.7.1.2 | 95.8 MB   | ###8       |  39% [A[A[A[A[A[A
2025-05-07T20:25:36.7111675Z 
2025-05-07T20:25:36.7111680Z 
2025-05-07T20:25:36.7111685Z 
2025-05-07T20:25:36.7111691Z 
2025-05-07T20:25:36.7114216Z 
2025-05-07T20:25:36.7195864Z cuda-nvvp-12.6.80    | 109.3 MB  | ######2    |  62% [A[A[A[A[A
2025-05-07T20:25:36.7350502Z nsight-compute-2024. | 443.1 MB  | ######8    |  69% 
2025-05-07T20:25:36.7350769Z 
2025-05-07T20:25:36.7350773Z 
2025-05-07T20:25:36.7350777Z 
2025-05-07T20:25:36.7350780Z 
2025-05-07T20:25:36.7350784Z 
2025-05-07T20:25:36.7350788Z 
2025-05-07T20:25:36.7352436Z 
2025-05-07T20:25:36.7509352Z libnpp-12.3.1.54     | 93.4 MB   | 5          |   6% [A[A[A[A[A[A[A
2025-05-07T20:25:36.7509756Z 
2025-05-07T20:25:36.7509762Z 
2025-05-07T20:25:36.7509767Z 
2025-05-07T20:25:36.7509772Z 
2025-05-07T20:25:36.7509777Z 
2025-05-07T20:25:36.7512962Z 
2025-05-07T20:25:36.8203930Z libcusolver-11.7.1.2 | 95.8 MB   | ####1      |  42% [A[A[A[A[A[A
2025-05-07T20:25:36.8355911Z nsight-compute-2024. | 443.1 MB  | ######9    |  70% 
2025-05-07T20:25:36.8356179Z 
2025-05-07T20:25:36.8356219Z 
2025-05-07T20:25:36.8356224Z 
2025-05-07T20:25:36.8356227Z 
2025-05-07T20:25:36.8356231Z 
2025-05-07T20:25:36.8356243Z 
2025-05-07T20:25:36.8357848Z 
2025-05-07T20:25:36.8402748Z libnpp-12.3.1.54     | 93.4 MB   | 8          |   8% [A[A[A[A[A[A[A
2025-05-07T20:25:36.8403045Z 
2025-05-07T20:25:36.8403049Z 
2025-05-07T20:25:36.8403052Z 
2025-05-07T20:25:36.8403056Z 
2025-05-07T20:25:36.8403060Z 
2025-05-07T20:25:36.8726062Z cuda-nvvp-12.6.80    | 109.3 MB  | ######5    |  65% [A[A[A[A[A
2025-05-07T20:25:36.8726381Z 
2025-05-07T20:25:36.8726385Z 
2025-05-07T20:25:36.8726388Z 
2025-05-07T20:25:36.8726392Z 
2025-05-07T20:25:36.8726396Z 
2025-05-07T20:25:36.8726399Z 
2025-05-07T20:25:36.9276745Z libcusolver-11.7.1.2 | 95.8 MB   | ####4      |  45% [A[A[A[A[A[A
2025-05-07T20:25:36.9360711Z nsight-compute-2024. | 443.1 MB  | #######    |  70% 
2025-05-07T20:25:36.9361035Z 
2025-05-07T20:25:36.9361041Z 
2025-05-07T20:25:36.9361068Z 
2025-05-07T20:25:36.9361074Z 
2025-05-07T20:25:36.9361079Z 
2025-05-07T20:25:36.9361085Z 
2025-05-07T20:25:36.9361091Z 
2025-05-07T20:25:36.9527367Z libnpp-12.3.1.54     | 93.4 MB   | #1         |  11% [A[A[A[A[A[A[A
2025-05-07T20:25:36.9528305Z 
2025-05-07T20:25:36.9528310Z 
2025-05-07T20:25:36.9528313Z 
2025-05-07T20:25:36.9528317Z 
2025-05-07T20:25:36.9530740Z 
2025-05-07T20:25:36.9854347Z cuda-nvvp-12.6.80    | 109.3 MB  | ######7    |  68% [A[A[A[A[A
2025-05-07T20:25:36.9854649Z 
2025-05-07T20:25:36.9854653Z 
2025-05-07T20:25:36.9854656Z 
2025-05-07T20:25:36.9854660Z 
2025-05-07T20:25:36.9854664Z 
2025-05-07T20:25:36.9856721Z 
2025-05-07T20:25:37.0356631Z libcusolver-11.7.1.2 | 95.8 MB   | ####7      |  47% [A[A[A[A[A[A
2025-05-07T20:25:37.0364026Z nsight-compute-2024. | 443.1 MB  | #######    |  71% 
2025-05-07T20:25:37.0364324Z 
2025-05-07T20:25:37.0364329Z 
2025-05-07T20:25:37.0364332Z 
2025-05-07T20:25:37.0364336Z 
2025-05-07T20:25:37.0364339Z 
2025-05-07T20:25:37.0364369Z 
2025-05-07T20:25:37.0364373Z 
2025-05-07T20:25:37.0528719Z libnpp-12.3.1.54     | 93.4 MB   | #4         |  14% [A[A[A[A[A[A[A
2025-05-07T20:25:37.0529113Z 
2025-05-07T20:25:37.0529132Z 
2025-05-07T20:25:37.0529137Z 
2025-05-07T20:25:37.0529142Z 
2025-05-07T20:25:37.0529147Z 
2025-05-07T20:25:37.0855847Z cuda-nvvp-12.6.80    | 109.3 MB  | #######    |  70% [A[A[A[A[A
2025-05-07T20:25:37.0856291Z 
2025-05-07T20:25:37.0856297Z 
2025-05-07T20:25:37.0856302Z 
2025-05-07T20:25:37.0856307Z 
2025-05-07T20:25:37.0856312Z 
2025-05-07T20:25:37.0856318Z 
2025-05-07T20:25:37.1369514Z libcusolver-11.7.1.2 | 95.8 MB   | #####      |  50% [A[A[A[A[A[A
2025-05-07T20:25:37.1369826Z 
2025-05-07T20:25:37.1369830Z 
2025-05-07T20:25:37.1369834Z 
2025-05-07T20:25:37.1369838Z 
2025-05-07T20:25:37.1369841Z 
2025-05-07T20:25:37.1369845Z 
2025-05-07T20:25:37.1373975Z 
2025-05-07T20:25:37.1559685Z libnpp-12.3.1.54     | 93.4 MB   | #7         |  17% [A[A[A[A[A[A[A
2025-05-07T20:25:37.1560102Z 
2025-05-07T20:25:37.1560387Z 
2025-05-07T20:25:37.1560407Z 
2025-05-07T20:25:37.1560412Z 
2025-05-07T20:25:37.1563491Z 
2025-05-07T20:25:37.1576177Z cuda-nvvp-12.6.80    | 109.3 MB  | #######2   |  73% [A[A[A[A[A
2025-05-07T20:25:37.1917057Z nsight-compute-2024. | 443.1 MB  | #######1   |  72% 
2025-05-07T20:25:37.1917364Z 
2025-05-07T20:25:37.1917368Z 
2025-05-07T20:25:37.1917372Z 
2025-05-07T20:25:37.1917375Z 
2025-05-07T20:25:37.1917379Z 
2025-05-07T20:25:37.1920103Z 
2025-05-07T20:25:37.2393976Z libcusolver-11.7.1.2 | 95.8 MB   | #####2     |  53% [A[A[A[A[A[A
2025-05-07T20:25:37.2394288Z 
2025-05-07T20:25:37.2394292Z 
2025-05-07T20:25:37.2394296Z 
2025-05-07T20:25:37.2394299Z 
2025-05-07T20:25:37.2394303Z 
2025-05-07T20:25:37.2394307Z 
2025-05-07T20:25:37.2397890Z 
2025-05-07T20:25:37.2633829Z libnpp-12.3.1.54     | 93.4 MB   | #9         |  20% [A[A[A[A[A[A[A
2025-05-07T20:25:37.2634179Z 
2025-05-07T20:25:37.2634183Z 
2025-05-07T20:25:37.2634186Z 
2025-05-07T20:25:37.2634190Z 
2025-05-07T20:25:37.2635760Z 
2025-05-07T20:25:37.2728901Z cuda-nvvp-12.6.80    | 109.3 MB  | #######5   |  75% [A[A[A[A[A
2025-05-07T20:25:37.2941005Z nsight-compute-2024. | 443.1 MB  | #######2   |  72% 
2025-05-07T20:25:37.2941334Z 
2025-05-07T20:25:37.2941340Z 
2025-05-07T20:25:37.2941343Z 
2025-05-07T20:25:37.2941434Z 
2025-05-07T20:25:37.2941437Z 
2025-05-07T20:25:37.2941464Z 
2025-05-07T20:25:37.3472995Z libcusolver-11.7.1.2 | 95.8 MB   | #####5     |  55% [A[A[A[A[A[A
2025-05-07T20:25:37.3473432Z 
2025-05-07T20:25:37.3473437Z 
2025-05-07T20:25:37.3473442Z 
2025-05-07T20:25:37.3473447Z 
2025-05-07T20:25:37.3473452Z 
2025-05-07T20:25:37.3473458Z 
2025-05-07T20:25:37.3475278Z 
2025-05-07T20:25:37.3692711Z libnpp-12.3.1.54     | 93.4 MB   | ##2        |  23% [A[A[A[A[A[A[A
2025-05-07T20:25:37.3693007Z 
2025-05-07T20:25:37.3693011Z 
2025-05-07T20:25:37.3693014Z 
2025-05-07T20:25:37.3693018Z 
2025-05-07T20:25:37.3699701Z 
2025-05-07T20:25:37.3812834Z cuda-nvvp-12.6.80    | 109.3 MB  | #######7   |  78% [A[A[A[A[A
2025-05-07T20:25:37.3948623Z nsight-compute-2024. | 443.1 MB  | #######2   |  73% 
2025-05-07T20:25:37.3948930Z 
2025-05-07T20:25:37.3948934Z 
2025-05-07T20:25:37.3948937Z 
2025-05-07T20:25:37.3948953Z 
2025-05-07T20:25:37.3948957Z 
2025-05-07T20:25:37.3950773Z 
2025-05-07T20:25:37.4473774Z libcusolver-11.7.1.2 | 95.8 MB   | #####8     |  58% [A[A[A[A[A[A
2025-05-07T20:25:37.4474091Z 
2025-05-07T20:25:37.4474095Z 
2025-05-07T20:25:37.4474099Z 
2025-05-07T20:25:37.4474103Z 
2025-05-07T20:25:37.4474106Z 
2025-05-07T20:25:37.4474110Z 
2025-05-07T20:25:37.4474114Z 
2025-05-07T20:25:37.4818241Z libnpp-12.3.1.54     | 93.4 MB   | ##5        |  26% [A[A[A[A[A[A[A
2025-05-07T20:25:37.4832214Z nsight-compute-2024. | 443.1 MB  | #######3   |  73% 
2025-05-07T20:25:37.4832484Z 
2025-05-07T20:25:37.4832490Z 
2025-05-07T20:25:37.4832495Z 
2025-05-07T20:25:37.4832500Z 
2025-05-07T20:25:37.4836811Z 
2025-05-07T20:25:37.4950744Z cuda-nvvp-12.6.80    | 109.3 MB  | #######9   |  80% [A[A[A[A[A
2025-05-07T20:25:37.4951109Z 
2025-05-07T20:25:37.4951116Z 
2025-05-07T20:25:37.4951122Z 
2025-05-07T20:25:37.4951128Z 
2025-05-07T20:25:37.4951135Z 
2025-05-07T20:25:37.4954063Z 
2025-05-07T20:25:37.5511869Z libcusolver-11.7.1.2 | 95.8 MB   | ######     |  61% [A[A[A[A[A[A
2025-05-07T20:25:37.5512183Z 
2025-05-07T20:25:37.5512187Z 
2025-05-07T20:25:37.5512191Z 
2025-05-07T20:25:37.5512194Z 
2025-05-07T20:25:37.5512198Z 
2025-05-07T20:25:37.5512201Z 
2025-05-07T20:25:37.5514259Z 
2025-05-07T20:25:37.5871029Z libnpp-12.3.1.54     | 93.4 MB   | ##8        |  29% [A[A[A[A[A[A[A
2025-05-07T20:25:37.5885158Z nsight-compute-2024. | 443.1 MB  | #######3   |  74% 
2025-05-07T20:25:37.5885526Z 
2025-05-07T20:25:37.5885532Z 
2025-05-07T20:25:37.5885537Z 
2025-05-07T20:25:37.5885542Z 
2025-05-07T20:25:37.5887810Z 
2025-05-07T20:25:37.5986621Z cuda-nvvp-12.6.80    | 109.3 MB  | ########2  |  82% [A[A[A[A[A
2025-05-07T20:25:37.5986931Z 
2025-05-07T20:25:37.5986936Z 
2025-05-07T20:25:37.5986941Z 
2025-05-07T20:25:37.5987227Z 
2025-05-07T20:25:37.5987235Z 
2025-05-07T20:25:37.5987239Z 
2025-05-07T20:25:37.6515328Z libcusolver-11.7.1.2 | 95.8 MB   | ######3    |  64% [A[A[A[A[A[A
2025-05-07T20:25:37.6516062Z 
2025-05-07T20:25:37.6516068Z 
2025-05-07T20:25:37.6516073Z 
2025-05-07T20:25:37.6516079Z 
2025-05-07T20:25:37.6516084Z 
2025-05-07T20:25:37.6516089Z 
2025-05-07T20:25:37.6519512Z 
2025-05-07T20:25:37.6946813Z libnpp-12.3.1.54     | 93.4 MB   | ###1       |  32% [A[A[A[A[A[A[A
2025-05-07T20:25:37.6947150Z 
2025-05-07T20:25:37.6947154Z 
2025-05-07T20:25:37.6947158Z 
2025-05-07T20:25:37.6947162Z 
2025-05-07T20:25:37.6952936Z 
2025-05-07T20:25:37.6982927Z cuda-nvvp-12.6.80    | 109.3 MB  | ########4  |  84% [A[A[A[A[A
2025-05-07T20:25:37.7009806Z nsight-compute-2024. | 443.1 MB  | #######4   |  75% 
2025-05-07T20:25:37.7010067Z 
2025-05-07T20:25:37.7010071Z 
2025-05-07T20:25:37.7010074Z 
2025-05-07T20:25:37.7010078Z 
2025-05-07T20:25:37.7010082Z 
2025-05-07T20:25:37.7012008Z 
2025-05-07T20:25:37.7517939Z libcusolver-11.7.1.2 | 95.8 MB   | ######6    |  66% [A[A[A[A[A[A
2025-05-07T20:25:37.7518243Z 
2025-05-07T20:25:37.7518247Z 
2025-05-07T20:25:37.7518251Z 
2025-05-07T20:25:37.7518264Z 
2025-05-07T20:25:37.7518268Z 
2025-05-07T20:25:37.7518272Z 
2025-05-07T20:25:37.7522466Z 
2025-05-07T20:25:37.7949404Z libnpp-12.3.1.54     | 93.4 MB   | ###4       |  35% [A[A[A[A[A[A[A
2025-05-07T20:25:37.7949686Z 
2025-05-07T20:25:37.7949690Z 
2025-05-07T20:25:37.7949694Z 
2025-05-07T20:25:37.7949698Z 
2025-05-07T20:25:37.7953377Z 
2025-05-07T20:25:37.7997426Z cuda-nvvp-12.6.80    | 109.3 MB  | ########6  |  87% [A[A[A[A[A
2025-05-07T20:25:37.8013142Z nsight-compute-2024. | 443.1 MB  | #######5   |  75% 
2025-05-07T20:25:37.8013401Z 
2025-05-07T20:25:37.8013405Z 
2025-05-07T20:25:37.8013408Z 
2025-05-07T20:25:37.8013412Z 
2025-05-07T20:25:37.8013415Z 
2025-05-07T20:25:37.8017473Z 
2025-05-07T20:25:37.8736625Z libcusolver-11.7.1.2 | 95.8 MB   | ######9    |  69% [A[A[A[A[A[A
2025-05-07T20:25:37.8736953Z 
2025-05-07T20:25:37.8736957Z 
2025-05-07T20:25:37.8736961Z 
2025-05-07T20:25:37.8736964Z 
2025-05-07T20:25:37.8736968Z 
2025-05-07T20:25:37.8736972Z 
2025-05-07T20:25:37.8742132Z 
2025-05-07T20:25:37.8951495Z libnpp-12.3.1.54     | 93.4 MB   | ###7       |  38% [A[A[A[A[A[A[A
2025-05-07T20:25:37.8951803Z 
2025-05-07T20:25:37.8951807Z 
2025-05-07T20:25:37.8951810Z 
2025-05-07T20:25:37.8951814Z 
2025-05-07T20:25:37.8954361Z 
2025-05-07T20:25:37.9015636Z cuda-nvvp-12.6.80    | 109.3 MB  | ########8  |  89% [A[A[A[A[A
2025-05-07T20:25:37.9015920Z 
2025-05-07T20:25:37.9015924Z 
2025-05-07T20:25:37.9015928Z 
2025-05-07T20:25:37.9015931Z 
2025-05-07T20:25:37.9015935Z 
2025-05-07T20:25:37.9017770Z 
2025-05-07T20:25:37.9025615Z libcusolver-11.7.1.2 | 95.8 MB   | #######1   |  72% [A[A[A[A[A[A
2025-05-07T20:25:37.9797464Z nsight-compute-2024. | 443.1 MB  | #######5   |  76% 
2025-05-07T20:25:37.9797736Z 
2025-05-07T20:25:37.9797740Z 
2025-05-07T20:25:37.9797743Z 
2025-05-07T20:25:37.9797763Z 
2025-05-07T20:25:37.9797767Z 
2025-05-07T20:25:37.9797771Z 
2025-05-07T20:25:37.9797774Z 
2025-05-07T20:25:37.9956411Z libnpp-12.3.1.54     | 93.4 MB   | ####       |  40% [A[A[A[A[A[A[A
2025-05-07T20:25:37.9956714Z 
2025-05-07T20:25:37.9956718Z 
2025-05-07T20:25:37.9956722Z 
2025-05-07T20:25:37.9956726Z 
2025-05-07T20:25:37.9959423Z 
2025-05-07T20:25:38.0031801Z cuda-nvvp-12.6.80    | 109.3 MB  | #########1 |  91% [A[A[A[A[A
2025-05-07T20:25:38.0032230Z 
2025-05-07T20:25:38.0032236Z 
2025-05-07T20:25:38.0032242Z 
2025-05-07T20:25:38.0032247Z 
2025-05-07T20:25:38.0032252Z 
2025-05-07T20:25:38.0032258Z 
2025-05-07T20:25:38.0134133Z libcusolver-11.7.1.2 | 95.8 MB   | #######4   |  74% [A[A[A[A[A[A
2025-05-07T20:25:38.0801156Z nsight-compute-2024. | 443.1 MB  | #######6   |  76% 
2025-05-07T20:25:38.0801586Z 
2025-05-07T20:25:38.0801593Z 
2025-05-07T20:25:38.0801598Z 
2025-05-07T20:25:38.0801603Z 
2025-05-07T20:25:38.0801608Z 
2025-05-07T20:25:38.0801613Z 
2025-05-07T20:25:38.0801618Z 
2025-05-07T20:25:38.1034777Z libnpp-12.3.1.54     | 93.4 MB   | ####3      |  43% [A[A[A[A[A[A[A
2025-05-07T20:25:38.1035086Z 
2025-05-07T20:25:38.1035089Z 
2025-05-07T20:25:38.1035093Z 
2025-05-07T20:25:38.1035628Z 
2025-05-07T20:25:38.1035634Z 
2025-05-07T20:25:38.1038942Z 
2025-05-07T20:25:38.1049390Z libcusolver-11.7.1.2 | 95.8 MB   | #######7   |  77% [A[A[A[A[A[A
2025-05-07T20:25:38.1049693Z 
2025-05-07T20:25:38.1049697Z 
2025-05-07T20:25:38.1049700Z 
2025-05-07T20:25:38.1049704Z 
2025-05-07T20:25:38.1055923Z 
2025-05-07T20:25:38.1231988Z cuda-nvvp-12.6.80    | 109.3 MB  | #########3 |  94% [A[A[A[A[A
2025-05-07T20:25:38.1889110Z nsight-compute-2024. | 443.1 MB  | #######6   |  77% 
2025-05-07T20:25:38.1889446Z 
2025-05-07T20:25:38.1889452Z 
2025-05-07T20:25:38.1889457Z 
2025-05-07T20:25:38.1889462Z 
2025-05-07T20:25:38.1889468Z 
2025-05-07T20:25:38.1889473Z 
2025-05-07T20:25:38.1889478Z 
2025-05-07T20:25:38.2090330Z libnpp-12.3.1.54     | 93.4 MB   | ####5      |  46% [A[A[A[A[A[A[A
2025-05-07T20:25:38.2090617Z 
2025-05-07T20:25:38.2090621Z 
2025-05-07T20:25:38.2090625Z 
2025-05-07T20:25:38.2090628Z 
2025-05-07T20:25:38.2091442Z 
2025-05-07T20:25:38.2107381Z cuda-nvvp-12.6.80    | 109.3 MB  | #########5 |  96% [A[A[A[A[A
2025-05-07T20:25:38.2107667Z 
2025-05-07T20:25:38.2107670Z 
2025-05-07T20:25:38.2107674Z 
2025-05-07T20:25:38.2116818Z 
2025-05-07T20:25:38.2127063Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:38.2127340Z 
2025-05-07T20:25:38.2127344Z 
2025-05-07T20:25:38.2127347Z 
2025-05-07T20:25:38.2127351Z 
2025-05-07T20:25:38.2127355Z 
2025-05-07T20:25:38.2128660Z 
2025-05-07T20:25:38.2282031Z libcusolver-11.7.1.2 | 95.8 MB   | #######9   |  80% [A[A[A[A[A[A
2025-05-07T20:25:38.2966573Z nsight-compute-2024. | 443.1 MB  | #######7   |  78% 
2025-05-07T20:25:38.2966940Z 
2025-05-07T20:25:38.2966946Z 
2025-05-07T20:25:38.2966961Z 
2025-05-07T20:25:38.2966966Z 
2025-05-07T20:25:38.2966972Z 
2025-05-07T20:25:38.2966977Z 
2025-05-07T20:25:38.2967000Z 
2025-05-07T20:25:38.3095665Z libnpp-12.3.1.54     | 93.4 MB   | ####8      |  49% [A[A[A[A[A[A[A
2025-05-07T20:25:38.3096062Z 
2025-05-07T20:25:38.3096067Z 
2025-05-07T20:25:38.3096086Z 
2025-05-07T20:25:38.3096091Z 
2025-05-07T20:25:38.3096097Z 
2025-05-07T20:25:38.3131342Z cuda-nvvp-12.6.80    | 109.3 MB  | #########8 |  99% [A[A[A[A[A
2025-05-07T20:25:38.3131716Z 
2025-05-07T20:25:38.3131720Z 
2025-05-07T20:25:38.3131724Z 
2025-05-07T20:25:38.3131727Z 
2025-05-07T20:25:38.3131731Z 
2025-05-07T20:25:38.3132716Z 
2025-05-07T20:25:38.3285548Z libcusolver-11.7.1.2 | 95.8 MB   | ########2  |  83% [A[A[A[A[A[A
2025-05-07T20:25:38.3967370Z nsight-compute-2024. | 443.1 MB  | #######8   |  78% 
2025-05-07T20:25:38.3967697Z 
2025-05-07T20:25:38.3967703Z 
2025-05-07T20:25:38.3967708Z 
2025-05-07T20:25:38.3967722Z 
2025-05-07T20:25:38.3967727Z 
2025-05-07T20:25:38.3967732Z 
2025-05-07T20:25:38.3967737Z 
2025-05-07T20:25:38.4131453Z libnpp-12.3.1.54     | 93.4 MB   | #####2     |  52% [A[A[A[A[A[A[A
2025-05-07T20:25:38.4131813Z 
2025-05-07T20:25:38.4131825Z 
2025-05-07T20:25:38.4131829Z 
2025-05-07T20:25:38.4131833Z 
2025-05-07T20:25:38.4131836Z 
2025-05-07T20:25:38.4133296Z 
2025-05-07T20:25:38.4288666Z libcusolver-11.7.1.2 | 95.8 MB   | ########5  |  86% [A[A[A[A[A[A
2025-05-07T20:25:38.4968436Z nsight-compute-2024. | 443.1 MB  | #######8   |  79% 
2025-05-07T20:25:38.4968827Z 
2025-05-07T20:25:38.4968833Z 
2025-05-07T20:25:38.4968838Z 
2025-05-07T20:25:38.4968853Z 
2025-05-07T20:25:38.4968858Z 
2025-05-07T20:25:38.4968863Z 
2025-05-07T20:25:38.4971725Z 
2025-05-07T20:25:38.5135423Z libnpp-12.3.1.54     | 93.4 MB   | #####5     |  55% [A[A[A[A[A[A[A
2025-05-07T20:25:38.5135725Z 
2025-05-07T20:25:38.5135729Z 
2025-05-07T20:25:38.5135733Z 
2025-05-07T20:25:38.5135736Z 
2025-05-07T20:25:38.5135740Z 
2025-05-07T20:25:38.5135744Z 
2025-05-07T20:25:38.5297269Z libcusolver-11.7.1.2 | 95.8 MB   | ########9  |  89% [A[A[A[A[A[A
2025-05-07T20:25:38.5977252Z nsight-compute-2024. | 443.1 MB  | #######9   |  79% 
2025-05-07T20:25:38.5977668Z 
2025-05-07T20:25:38.5977674Z 
2025-05-07T20:25:38.5977679Z 
2025-05-07T20:25:38.5977684Z 
2025-05-07T20:25:38.5977698Z 
2025-05-07T20:25:38.5977857Z 
2025-05-07T20:25:38.5980343Z 
2025-05-07T20:25:38.6136294Z libnpp-12.3.1.54     | 93.4 MB   | #####8     |  59% [A[A[A[A[A[A[A
2025-05-07T20:25:38.6136648Z 
2025-05-07T20:25:38.6136662Z 
2025-05-07T20:25:38.6136668Z 
2025-05-07T20:25:38.6136673Z 
2025-05-07T20:25:38.6136678Z 
2025-05-07T20:25:38.6140282Z 
2025-05-07T20:25:38.6299991Z libcusolver-11.7.1.2 | 95.8 MB   | #########2 |  93% [A[A[A[A[A[A
2025-05-07T20:25:38.6977585Z nsight-compute-2024. | 443.1 MB  | ########   |  80% 
2025-05-07T20:25:38.6977912Z 
2025-05-07T20:25:38.6977916Z 
2025-05-07T20:25:38.6977919Z 
2025-05-07T20:25:38.6977923Z 
2025-05-07T20:25:38.6977927Z 
2025-05-07T20:25:38.6977940Z 
2025-05-07T20:25:38.6977944Z 
2025-05-07T20:25:38.7146777Z libnpp-12.3.1.54     | 93.4 MB   | ######2    |  63% [A[A[A[A[A[A[A
2025-05-07T20:25:38.7147139Z 
2025-05-07T20:25:38.7147143Z 
2025-05-07T20:25:38.7147146Z 
2025-05-07T20:25:38.7147157Z 
2025-05-07T20:25:38.7147161Z 
2025-05-07T20:25:38.7147801Z 
2025-05-07T20:25:38.7304919Z libcusolver-11.7.1.2 | 95.8 MB   | #########6 |  96% [A[A[A[A[A[A
2025-05-07T20:25:38.8064674Z nsight-compute-2024. | 443.1 MB  | ########   |  81% 
2025-05-07T20:25:38.8065007Z 
2025-05-07T20:25:38.8065011Z 
2025-05-07T20:25:38.8065015Z 
2025-05-07T20:25:38.8065018Z 
2025-05-07T20:25:38.8065022Z 
2025-05-07T20:25:38.8065025Z 
2025-05-07T20:25:38.8068138Z 
2025-05-07T20:25:38.8149583Z libnpp-12.3.1.54     | 93.4 MB   | ######5    |  66% [A[A[A[A[A[A[A
2025-05-07T20:25:38.8149877Z 
2025-05-07T20:25:38.8149881Z 
2025-05-07T20:25:38.8149884Z 
2025-05-07T20:25:38.8149888Z 
2025-05-07T20:25:38.8149899Z 
2025-05-07T20:25:38.8150642Z 
2025-05-07T20:25:38.8305464Z libcusolver-11.7.1.2 | 95.8 MB   | #########9 | 100% [A[A[A[A[A[A
2025-05-07T20:25:38.9066977Z nsight-compute-2024. | 443.1 MB  | ########1  |  82% 
2025-05-07T20:25:38.9067331Z 
2025-05-07T20:25:38.9067337Z 
2025-05-07T20:25:38.9067342Z 
2025-05-07T20:25:38.9067347Z 
2025-05-07T20:25:38.9067352Z 
2025-05-07T20:25:38.9067370Z 
2025-05-07T20:25:38.9070497Z 
2025-05-07T20:25:38.9306390Z libnpp-12.3.1.54     | 93.4 MB   | ######9    |  69% [A[A[A[A[A[A[A
2025-05-07T20:25:39.0121377Z nsight-compute-2024. | 443.1 MB  | ########2  |  82% 
2025-05-07T20:25:39.0121714Z 
2025-05-07T20:25:39.0121720Z 
2025-05-07T20:25:39.0121725Z 
2025-05-07T20:25:39.0121730Z 
2025-05-07T20:25:39.0121735Z 
2025-05-07T20:25:39.0121741Z 
2025-05-07T20:25:39.0126222Z 
2025-05-07T20:25:39.0312883Z libnpp-12.3.1.54     | 93.4 MB   | #######2   |  72% [A[A[A[A[A[A[A
2025-05-07T20:25:39.1167893Z nsight-compute-2024. | 443.1 MB  | ########3  |  83% 
2025-05-07T20:25:39.1168212Z 
2025-05-07T20:25:39.1168216Z 
2025-05-07T20:25:39.1168220Z 
2025-05-07T20:25:39.1168224Z 
2025-05-07T20:25:39.1168227Z 
2025-05-07T20:25:39.1168231Z 
2025-05-07T20:25:39.1168261Z 
2025-05-07T20:25:39.1418304Z libnpp-12.3.1.54     | 93.4 MB   | #######5   |  76% [A[A[A[A[A[A[A
2025-05-07T20:25:39.2170042Z nsight-compute-2024. | 443.1 MB  | ########3  |  84% 
2025-05-07T20:25:39.2170345Z 
2025-05-07T20:25:39.2170349Z 
2025-05-07T20:25:39.2170353Z 
2025-05-07T20:25:39.2170356Z 
2025-05-07T20:25:39.2170360Z 
2025-05-07T20:25:39.2170364Z 
2025-05-07T20:25:39.2170368Z 
2025-05-07T20:25:39.2445317Z libnpp-12.3.1.54     | 93.4 MB   | #######8   |  79% [A[A[A[A[A[A[A
2025-05-07T20:25:39.3175997Z nsight-compute-2024. | 443.1 MB  | ########4  |  85% 
2025-05-07T20:25:39.3176276Z 
2025-05-07T20:25:39.3176281Z 
2025-05-07T20:25:39.3176284Z 
2025-05-07T20:25:39.3176288Z 
2025-05-07T20:25:39.3176292Z 
2025-05-07T20:25:39.3176296Z 
2025-05-07T20:25:39.3176300Z 
2025-05-07T20:25:39.3451582Z libnpp-12.3.1.54     | 93.4 MB   | ########1  |  82% [A[A[A[A[A[A[A
2025-05-07T20:25:39.4451116Z nsight-compute-2024. | 443.1 MB  | ########5  |  85% 
2025-05-07T20:25:39.4870936Z nsight-compute-2024. | 443.1 MB  | ########5  |  86% 
2025-05-07T20:25:39.4871555Z 
2025-05-07T20:25:39.4871559Z 
2025-05-07T20:25:39.4871563Z 
2025-05-07T20:25:39.4871567Z 
2025-05-07T20:25:39.4871779Z 
2025-05-07T20:25:39.4871783Z 
2025-05-07T20:25:39.4872966Z 
2025-05-07T20:25:39.5456654Z libnpp-12.3.1.54     | 93.4 MB   | ########5  |  85% [A[A[A[A[A[A[A
2025-05-07T20:25:39.5876432Z nsight-compute-2024. | 443.1 MB  | ########6  |  87% 
2025-05-07T20:25:39.5876804Z 
2025-05-07T20:25:39.5876808Z 
2025-05-07T20:25:39.5876818Z 
2025-05-07T20:25:39.5876822Z 
2025-05-07T20:25:39.5876825Z 
2025-05-07T20:25:39.5876829Z 
2025-05-07T20:25:39.5876833Z 
2025-05-07T20:25:39.6521491Z libnpp-12.3.1.54     | 93.4 MB   | ########8  |  88% [A[A[A[A[A[A[A
2025-05-07T20:25:39.6890713Z nsight-compute-2024. | 443.1 MB  | ########7  |  87% 
2025-05-07T20:25:39.6891177Z 
2025-05-07T20:25:39.6891435Z 
2025-05-07T20:25:39.6891441Z 
2025-05-07T20:25:39.6891447Z 
2025-05-07T20:25:39.6891470Z 
2025-05-07T20:25:39.6891487Z 
2025-05-07T20:25:39.6891671Z 
2025-05-07T20:25:39.7557190Z libnpp-12.3.1.54     | 93.4 MB   | #########1 |  92% [A[A[A[A[A[A[A
2025-05-07T20:25:39.7892443Z nsight-compute-2024. | 443.1 MB  | ########8  |  88% 
2025-05-07T20:25:39.7892725Z 
2025-05-07T20:25:39.7892729Z 
2025-05-07T20:25:39.7892733Z 
2025-05-07T20:25:39.7892737Z 
2025-05-07T20:25:39.7892740Z 
2025-05-07T20:25:39.7892744Z 
2025-05-07T20:25:39.7894163Z 
2025-05-07T20:25:39.8558368Z libnpp-12.3.1.54     | 93.4 MB   | #########4 |  95% [A[A[A[A[A[A[A
2025-05-07T20:25:39.8894108Z nsight-compute-2024. | 443.1 MB  | ########8  |  89% 
2025-05-07T20:25:39.8894421Z 
2025-05-07T20:25:39.8894555Z 
2025-05-07T20:25:39.8894561Z 
2025-05-07T20:25:39.8894566Z 
2025-05-07T20:25:39.8894571Z 
2025-05-07T20:25:39.8894577Z 
2025-05-07T20:25:39.8894604Z 
2025-05-07T20:25:39.9560731Z libnpp-12.3.1.54     | 93.4 MB   | #########8 |  99% [A[A[A[A[A[A[A
2025-05-07T20:25:40.0560682Z nsight-compute-2024. | 443.1 MB  | ########9  |  90% 
2025-05-07T20:25:40.1562236Z nsight-compute-2024. | 443.1 MB  | #########  |  91% 
2025-05-07T20:25:40.2564375Z nsight-compute-2024. | 443.1 MB  | #########1 |  91% 
2025-05-07T20:25:40.3568748Z nsight-compute-2024. | 443.1 MB  | #########2 |  92% 
2025-05-07T20:25:40.4573909Z nsight-compute-2024. | 443.1 MB  | #########3 |  93% 
2025-05-07T20:25:40.5575362Z nsight-compute-2024. | 443.1 MB  | #########4 |  94% 
2025-05-07T20:25:40.6609152Z nsight-compute-2024. | 443.1 MB  | #########5 |  95% 
2025-05-07T20:25:40.7628716Z nsight-compute-2024. | 443.1 MB  | #########5 |  96% 
2025-05-07T20:25:40.8661387Z nsight-compute-2024. | 443.1 MB  | #########6 |  97% 
2025-05-07T20:25:40.9663007Z nsight-compute-2024. | 443.1 MB  | #########7 |  98% 
2025-05-07T20:25:41.0760551Z nsight-compute-2024. | 443.1 MB  | #########8 |  99% 
2025-05-07T20:25:41.6693414Z nsight-compute-2024. | 443.1 MB  | #########9 | 100% 
2025-05-07T20:25:41.6693810Z 
2025-05-07T20:25:41.6693816Z 
2025-05-07T20:25:41.6693821Z 
2025-05-07T20:25:41.6693826Z 
2025-05-07T20:25:41.6693850Z 
2025-05-07T20:25:41.6693856Z 
2025-05-07T20:25:41.7373114Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:41.7373417Z 
2025-05-07T20:25:41.7373436Z 
2025-05-07T20:25:41.7373439Z 
2025-05-07T20:25:41.7373443Z 
2025-05-07T20:25:41.7373446Z 
2025-05-07T20:25:41.7373450Z 
2025-05-07T20:25:41.7373454Z 
2025-05-07T20:25:41.7374932Z 
2025-05-07T20:25:41.7902392Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.7902712Z 
2025-05-07T20:25:41.7902716Z 
2025-05-07T20:25:41.7902720Z 
2025-05-07T20:25:41.7902723Z 
2025-05-07T20:25:41.7902727Z 
2025-05-07T20:25:41.8373183Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:41.8373578Z 
2025-05-07T20:25:41.8373583Z 
2025-05-07T20:25:41.8373589Z 
2025-05-07T20:25:41.8373594Z 
2025-05-07T20:25:41.8373600Z 
2025-05-07T20:25:41.8373605Z 
2025-05-07T20:25:41.8373626Z 
2025-05-07T20:25:41.8373631Z 
2025-05-07T20:25:41.8534062Z cuda-nvdisasm-12.6.7 | 47.6 MB   | 6          |   6% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.8534366Z 
2025-05-07T20:25:41.8534369Z 
2025-05-07T20:25:41.8534380Z 
2025-05-07T20:25:41.8534383Z 
2025-05-07T20:25:41.8534521Z 
2025-05-07T20:25:41.8534525Z 
2025-05-07T20:25:41.8534528Z 
2025-05-07T20:25:41.8534532Z 
2025-05-07T20:25:41.8535866Z 
2025-05-07T20:25:41.9393644Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:41.9393975Z 
2025-05-07T20:25:41.9393979Z 
2025-05-07T20:25:41.9393983Z 
2025-05-07T20:25:41.9393986Z 
2025-05-07T20:25:41.9393990Z 
2025-05-07T20:25:41.9393994Z 
2025-05-07T20:25:41.9393998Z 
2025-05-07T20:25:41.9394001Z 
2025-05-07T20:25:41.9537111Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #2         |  13% [A[A[A[A[A[A[A[A
2025-05-07T20:25:41.9537420Z 
2025-05-07T20:25:41.9537424Z 
2025-05-07T20:25:41.9537427Z 
2025-05-07T20:25:41.9537431Z 
2025-05-07T20:25:41.9537435Z 
2025-05-07T20:25:41.9537439Z 
2025-05-07T20:25:41.9537449Z 
2025-05-07T20:25:41.9537466Z 
2025-05-07T20:25:41.9538920Z 
2025-05-07T20:25:42.0395239Z libcurand-10.3.7.77  | 39.9 MB   | 7          |   8% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.0395552Z 
2025-05-07T20:25:42.0395568Z 
2025-05-07T20:25:42.0395574Z 
2025-05-07T20:25:42.0395579Z 
2025-05-07T20:25:42.0395584Z 
2025-05-07T20:25:42.0395590Z 
2025-05-07T20:25:42.0395595Z 
2025-05-07T20:25:42.0400916Z 
2025-05-07T20:25:42.0586935Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #9         |  19% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.0587380Z 
2025-05-07T20:25:42.0587386Z 
2025-05-07T20:25:42.0587391Z 
2025-05-07T20:25:42.0587396Z 
2025-05-07T20:25:42.0587401Z 
2025-05-07T20:25:42.0587406Z 
2025-05-07T20:25:42.0587411Z 
2025-05-07T20:25:42.0587416Z 
2025-05-07T20:25:42.0589564Z 
2025-05-07T20:25:42.1427434Z libcurand-10.3.7.77  | 39.9 MB   | #5         |  15% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.1427750Z 
2025-05-07T20:25:42.1427754Z 
2025-05-07T20:25:42.1427757Z 
2025-05-07T20:25:42.1427761Z 
2025-05-07T20:25:42.1427776Z 
2025-05-07T20:25:42.1427780Z 
2025-05-07T20:25:42.1427784Z 
2025-05-07T20:25:42.1429986Z 
2025-05-07T20:25:42.1590808Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##5        |  26% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.1591250Z 
2025-05-07T20:25:42.1591256Z 
2025-05-07T20:25:42.1591261Z 
2025-05-07T20:25:42.1591266Z 
2025-05-07T20:25:42.1591272Z 
2025-05-07T20:25:42.1591277Z 
2025-05-07T20:25:42.1591282Z 
2025-05-07T20:25:42.1591287Z 
2025-05-07T20:25:42.1593733Z 
2025-05-07T20:25:42.2430015Z libcurand-10.3.7.77  | 39.9 MB   | ##3        |  23% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.2430317Z 
2025-05-07T20:25:42.2430321Z 
2025-05-07T20:25:42.2430325Z 
2025-05-07T20:25:42.2430328Z 
2025-05-07T20:25:42.2430332Z 
2025-05-07T20:25:42.2430336Z 
2025-05-07T20:25:42.2430348Z 
2025-05-07T20:25:42.2430352Z 
2025-05-07T20:25:42.2591718Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###2       |  32% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.2592135Z 
2025-05-07T20:25:42.2592142Z 
2025-05-07T20:25:42.2592147Z 
2025-05-07T20:25:42.2592164Z 
2025-05-07T20:25:42.2592170Z 
2025-05-07T20:25:42.2592175Z 
2025-05-07T20:25:42.2592190Z 
2025-05-07T20:25:42.2592195Z 
2025-05-07T20:25:42.2595976Z 
2025-05-07T20:25:42.3459111Z libcurand-10.3.7.77  | 39.9 MB   | ###        |  31% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.3459415Z 
2025-05-07T20:25:42.3459425Z 
2025-05-07T20:25:42.3459429Z 
2025-05-07T20:25:42.3459433Z 
2025-05-07T20:25:42.3459436Z 
2025-05-07T20:25:42.3459440Z 
2025-05-07T20:25:42.3459444Z 
2025-05-07T20:25:42.3459447Z 
2025-05-07T20:25:42.3647917Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###8       |  39% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.3648294Z 
2025-05-07T20:25:42.3648298Z 
2025-05-07T20:25:42.3648302Z 
2025-05-07T20:25:42.3648305Z 
2025-05-07T20:25:42.3648309Z 
2025-05-07T20:25:42.3648313Z 
2025-05-07T20:25:42.3648317Z 
2025-05-07T20:25:42.3648320Z 
2025-05-07T20:25:42.3649582Z 
2025-05-07T20:25:42.4467796Z libcurand-10.3.7.77  | 39.9 MB   | ###8       |  39% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.4468367Z 
2025-05-07T20:25:42.4468372Z 
2025-05-07T20:25:42.4468376Z 
2025-05-07T20:25:42.4468379Z 
2025-05-07T20:25:42.4468383Z 
2025-05-07T20:25:42.4468387Z 
2025-05-07T20:25:42.4468515Z 
2025-05-07T20:25:42.4470136Z 
2025-05-07T20:25:42.4650796Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####5      |  46% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.4651110Z 
2025-05-07T20:25:42.4651114Z 
2025-05-07T20:25:42.4651118Z 
2025-05-07T20:25:42.4651122Z 
2025-05-07T20:25:42.4651125Z 
2025-05-07T20:25:42.4651129Z 
2025-05-07T20:25:42.4651132Z 
2025-05-07T20:25:42.4651136Z 
2025-05-07T20:25:42.4652443Z 
2025-05-07T20:25:42.5480239Z libcurand-10.3.7.77  | 39.9 MB   | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.5480577Z 
2025-05-07T20:25:42.5480581Z 
2025-05-07T20:25:42.5480585Z 
2025-05-07T20:25:42.5480588Z 
2025-05-07T20:25:42.5480592Z 
2025-05-07T20:25:42.5480596Z 
2025-05-07T20:25:42.5480605Z 
2025-05-07T20:25:42.5482766Z 
2025-05-07T20:25:42.5652552Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####2     |  52% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.5652890Z 
2025-05-07T20:25:42.5652894Z 
2025-05-07T20:25:42.5652906Z 
2025-05-07T20:25:42.5652909Z 
2025-05-07T20:25:42.5652913Z 
2025-05-07T20:25:42.5652923Z 
2025-05-07T20:25:42.5652926Z 
2025-05-07T20:25:42.5652930Z 
2025-05-07T20:25:42.5654837Z 
2025-05-07T20:25:42.6503867Z libcurand-10.3.7.77  | 39.9 MB   | #####5     |  56% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.6504221Z 
2025-05-07T20:25:42.6504225Z 
2025-05-07T20:25:42.6504229Z 
2025-05-07T20:25:42.6504232Z 
2025-05-07T20:25:42.6504236Z 
2025-05-07T20:25:42.6504239Z 
2025-05-07T20:25:42.6504243Z 
2025-05-07T20:25:42.6508658Z 
2025-05-07T20:25:42.6724701Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####8     |  59% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.6725011Z 
2025-05-07T20:25:42.6725014Z 
2025-05-07T20:25:42.6725018Z 
2025-05-07T20:25:42.6725022Z 
2025-05-07T20:25:42.6725025Z 
2025-05-07T20:25:42.6725029Z 
2025-05-07T20:25:42.6725033Z 
2025-05-07T20:25:42.6725036Z 
2025-05-07T20:25:42.6727917Z 
2025-05-07T20:25:42.7504087Z libcurand-10.3.7.77  | 39.9 MB   | ######3    |  64% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.7504506Z 
2025-05-07T20:25:42.7504512Z 
2025-05-07T20:25:42.7504526Z 
2025-05-07T20:25:42.7504531Z 
2025-05-07T20:25:42.7504536Z 
2025-05-07T20:25:42.7504542Z 
2025-05-07T20:25:42.7504546Z 
2025-05-07T20:25:42.7505142Z 
2025-05-07T20:25:42.7767827Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######5    |  66% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.7768180Z 
2025-05-07T20:25:42.7768184Z 
2025-05-07T20:25:42.7768187Z 
2025-05-07T20:25:42.7768191Z 
2025-05-07T20:25:42.7768194Z 
2025-05-07T20:25:42.7768205Z 
2025-05-07T20:25:42.7768209Z 
2025-05-07T20:25:42.7768212Z 
2025-05-07T20:25:42.7770778Z 
2025-05-07T20:25:42.8558040Z libcurand-10.3.7.77  | 39.9 MB   | #######1   |  72% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.8558469Z 
2025-05-07T20:25:42.8558475Z 
2025-05-07T20:25:42.8558480Z 
2025-05-07T20:25:42.8558485Z 
2025-05-07T20:25:42.8558490Z 
2025-05-07T20:25:42.8558510Z 
2025-05-07T20:25:42.8558516Z 
2025-05-07T20:25:42.8560662Z 
2025-05-07T20:25:42.8773433Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######2   |  72% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.8773760Z 
2025-05-07T20:25:42.8773764Z 
2025-05-07T20:25:42.8773767Z 
2025-05-07T20:25:42.8773771Z 
2025-05-07T20:25:42.8773775Z 
2025-05-07T20:25:42.8773778Z 
2025-05-07T20:25:42.8773782Z 
2025-05-07T20:25:42.8773786Z 
2025-05-07T20:25:42.8773789Z 
2025-05-07T20:25:42.9341478Z libcurand-10.3.7.77  | 39.9 MB   | #######9   |  80% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.9341813Z 
2025-05-07T20:25:42.9341817Z 
2025-05-07T20:25:42.9341820Z 
2025-05-07T20:25:42.9341824Z 
2025-05-07T20:25:42.9341828Z 
2025-05-07T20:25:42.9341831Z 
2025-05-07T20:25:42.9344062Z 
2025-05-07T20:25:42.9590498Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:42.9590807Z 
2025-05-07T20:25:42.9590811Z 
2025-05-07T20:25:42.9590815Z 
2025-05-07T20:25:42.9590818Z 
2025-05-07T20:25:42.9590822Z 
2025-05-07T20:25:42.9591038Z 
2025-05-07T20:25:42.9591043Z 
2025-05-07T20:25:42.9591450Z 
2025-05-07T20:25:42.9775411Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######8   |  79% [A[A[A[A[A[A[A[A
2025-05-07T20:25:42.9775937Z 
2025-05-07T20:25:42.9775941Z 
2025-05-07T20:25:42.9775945Z 
2025-05-07T20:25:42.9775948Z 
2025-05-07T20:25:42.9775952Z 
2025-05-07T20:25:42.9775955Z 
2025-05-07T20:25:42.9775959Z 
2025-05-07T20:25:42.9775963Z 
2025-05-07T20:25:42.9776818Z 
2025-05-07T20:25:42.9889194Z libcurand-10.3.7.77  | 39.9 MB   | ########7  |  88% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:42.9889547Z 
2025-05-07T20:25:42.9889551Z 
2025-05-07T20:25:42.9889554Z 
2025-05-07T20:25:42.9889558Z 
2025-05-07T20:25:42.9889561Z 
2025-05-07T20:25:42.9889565Z 
2025-05-07T20:25:42.9889569Z 
2025-05-07T20:25:42.9889572Z 
2025-05-07T20:25:42.9889576Z 
2025-05-07T20:25:42.9889579Z 
2025-05-07T20:25:43.0653635Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.0653938Z 
2025-05-07T20:25:43.0653953Z 
2025-05-07T20:25:43.0653956Z 
2025-05-07T20:25:43.0653960Z 
2025-05-07T20:25:43.0653964Z 
2025-05-07T20:25:43.0653974Z 
2025-05-07T20:25:43.0653978Z 
2025-05-07T20:25:43.0655735Z 
2025-05-07T20:25:43.0857988Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########5  |  85% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.0858366Z 
2025-05-07T20:25:43.0858378Z 
2025-05-07T20:25:43.0858382Z 
2025-05-07T20:25:43.0858385Z 
2025-05-07T20:25:43.0858389Z 
2025-05-07T20:25:43.0858393Z 
2025-05-07T20:25:43.0858396Z 
2025-05-07T20:25:43.0858400Z 
2025-05-07T20:25:43.0864247Z 
2025-05-07T20:25:43.0898443Z libcurand-10.3.7.77  | 39.9 MB   | #########5 |  96% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.0898824Z 
2025-05-07T20:25:43.0898828Z 
2025-05-07T20:25:43.0898832Z 
2025-05-07T20:25:43.0898836Z 
2025-05-07T20:25:43.0898839Z 
2025-05-07T20:25:43.0898843Z 
2025-05-07T20:25:43.0898846Z 
2025-05-07T20:25:43.0898850Z 
2025-05-07T20:25:43.0898854Z 
2025-05-07T20:25:43.0898857Z 
2025-05-07T20:25:43.1742971Z gds-tools-1.11.1.6   | 37.8 MB   | 6          |   7% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.1743286Z 
2025-05-07T20:25:43.1743290Z 
2025-05-07T20:25:43.1743294Z 
2025-05-07T20:25:43.1743302Z 
2025-05-07T20:25:43.1743305Z 
2025-05-07T20:25:43.1743309Z 
2025-05-07T20:25:43.1743313Z 
2025-05-07T20:25:43.1747713Z 
2025-05-07T20:25:43.1908912Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########1 |  92% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.1909253Z 
2025-05-07T20:25:43.1909259Z 
2025-05-07T20:25:43.1909264Z 
2025-05-07T20:25:43.1909269Z 
2025-05-07T20:25:43.1909274Z 
2025-05-07T20:25:43.1909279Z 
2025-05-07T20:25:43.1909283Z 
2025-05-07T20:25:43.1909288Z 
2025-05-07T20:25:43.1909293Z 
2025-05-07T20:25:43.1909384Z 
2025-05-07T20:25:43.2747369Z gds-tools-1.11.1.6   | 37.8 MB   | #5         |  15% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.2747731Z 
2025-05-07T20:25:43.2747735Z 
2025-05-07T20:25:43.2747739Z 
2025-05-07T20:25:43.2747742Z 
2025-05-07T20:25:43.2747746Z 
2025-05-07T20:25:43.2747750Z 
2025-05-07T20:25:43.2747765Z 
2025-05-07T20:25:43.2749780Z 
2025-05-07T20:25:43.2918374Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########8 |  98% [A[A[A[A[A[A[A[A
2025-05-07T20:25:43.2918685Z 
2025-05-07T20:25:43.2918699Z 
2025-05-07T20:25:43.2918702Z 
2025-05-07T20:25:43.2918706Z 
2025-05-07T20:25:43.2918718Z 
2025-05-07T20:25:43.2918721Z 
2025-05-07T20:25:43.2918725Z 
2025-05-07T20:25:43.2918728Z 
2025-05-07T20:25:43.2918732Z 
2025-05-07T20:25:43.2927356Z 
2025-05-07T20:25:43.3918326Z gds-tools-1.11.1.6   | 37.8 MB   | ##3        |  23% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.3918639Z 
2025-05-07T20:25:43.3918643Z 
2025-05-07T20:25:43.3918646Z 
2025-05-07T20:25:43.3918650Z 
2025-05-07T20:25:43.3918654Z 
2025-05-07T20:25:43.3918657Z 
2025-05-07T20:25:43.3918661Z 
2025-05-07T20:25:43.3918664Z 
2025-05-07T20:25:43.3918668Z 
2025-05-07T20:25:43.3921449Z 
2025-05-07T20:25:43.4920728Z gds-tools-1.11.1.6   | 37.8 MB   | ###1       |  32% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.4921154Z 
2025-05-07T20:25:43.4921415Z 
2025-05-07T20:25:43.4921422Z 
2025-05-07T20:25:43.4921427Z 
2025-05-07T20:25:43.4921432Z 
2025-05-07T20:25:43.4921437Z 
2025-05-07T20:25:43.4921455Z 
2025-05-07T20:25:43.4921625Z 
2025-05-07T20:25:43.4921631Z 
2025-05-07T20:25:43.4921636Z 
2025-05-07T20:25:43.5931838Z gds-tools-1.11.1.6   | 37.8 MB   | ####1      |  41% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.5932259Z 
2025-05-07T20:25:43.5932265Z 
2025-05-07T20:25:43.5932270Z 
2025-05-07T20:25:43.5932276Z 
2025-05-07T20:25:43.5932281Z 
2025-05-07T20:25:43.5932286Z 
2025-05-07T20:25:43.5932291Z 
2025-05-07T20:25:43.5932296Z 
2025-05-07T20:25:43.5932301Z 
2025-05-07T20:25:43.5933102Z 
2025-05-07T20:25:43.6933148Z gds-tools-1.11.1.6   | 37.8 MB   | ####9      |  50% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.6933591Z 
2025-05-07T20:25:43.6933597Z 
2025-05-07T20:25:43.6933602Z 
2025-05-07T20:25:43.6933607Z 
2025-05-07T20:25:43.6933612Z 
2025-05-07T20:25:43.6933618Z 
2025-05-07T20:25:43.6933623Z 
2025-05-07T20:25:43.6933646Z 
2025-05-07T20:25:43.6933652Z 
2025-05-07T20:25:43.6933657Z 
2025-05-07T20:25:43.7933351Z gds-tools-1.11.1.6   | 37.8 MB   | #####8     |  59% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.7933670Z 
2025-05-07T20:25:43.7933681Z 
2025-05-07T20:25:43.7933685Z 
2025-05-07T20:25:43.7933689Z 
2025-05-07T20:25:43.7933692Z 
2025-05-07T20:25:43.7933696Z 
2025-05-07T20:25:43.7933700Z 
2025-05-07T20:25:43.7933703Z 
2025-05-07T20:25:43.7933707Z 
2025-05-07T20:25:43.7935085Z 
2025-05-07T20:25:43.8353501Z gds-tools-1.11.1.6   | 37.8 MB   | ######8    |  69% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.8356317Z 
2025-05-07T20:25:43.8937994Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:25:43.8938359Z 
2025-05-07T20:25:43.8938363Z 
2025-05-07T20:25:43.8938367Z 
2025-05-07T20:25:43.8938371Z 
2025-05-07T20:25:43.8938374Z 
2025-05-07T20:25:43.8938378Z 
2025-05-07T20:25:43.8938382Z 
2025-05-07T20:25:43.8938385Z 
2025-05-07T20:25:43.8938389Z 
2025-05-07T20:25:43.8940889Z 
2025-05-07T20:25:43.9008174Z gds-tools-1.11.1.6   | 37.8 MB   | ########   |  80% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9008576Z 
2025-05-07T20:25:43.9008583Z 
2025-05-07T20:25:43.9008588Z 
2025-05-07T20:25:43.9008606Z 
2025-05-07T20:25:43.9008621Z 
2025-05-07T20:25:43.9008626Z 
2025-05-07T20:25:43.9008631Z 
2025-05-07T20:25:43.9008636Z 
2025-05-07T20:25:43.9008642Z 
2025-05-07T20:25:43.9008647Z 
2025-05-07T20:25:43.9011702Z 
2025-05-07T20:25:43.9016314Z python-3.10.13       | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:43.9016643Z 
2025-05-07T20:25:43.9016647Z 
2025-05-07T20:25:43.9018631Z 
2025-05-07T20:25:44.0015753Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:44.0016149Z 
2025-05-07T20:25:44.0016155Z 
2025-05-07T20:25:44.0016159Z 
2025-05-07T20:25:44.0016163Z 
2025-05-07T20:25:44.0016167Z 
2025-05-07T20:25:44.0016171Z 
2025-05-07T20:25:44.0016174Z 
2025-05-07T20:25:44.0016178Z 
2025-05-07T20:25:44.0016182Z 
2025-05-07T20:25:44.0016186Z 
2025-05-07T20:25:44.0019061Z 
2025-05-07T20:25:44.0027702Z python-3.10.13       | 24.5 MB   | #5         |  15% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.0028010Z 
2025-05-07T20:25:44.0028024Z 
2025-05-07T20:25:44.0028028Z 
2025-05-07T20:25:44.0028032Z 
2025-05-07T20:25:44.0028036Z 
2025-05-07T20:25:44.0028039Z 
2025-05-07T20:25:44.0028043Z 
2025-05-07T20:25:44.0028047Z 
2025-05-07T20:25:44.0028050Z 
2025-05-07T20:25:44.0028393Z 
2025-05-07T20:25:44.1172209Z gds-tools-1.11.1.6   | 37.8 MB   | #########  |  90% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.1172656Z 
2025-05-07T20:25:44.1172661Z 
2025-05-07T20:25:44.1172664Z 
2025-05-07T20:25:44.1172668Z 
2025-05-07T20:25:44.1172671Z 
2025-05-07T20:25:44.1172675Z 
2025-05-07T20:25:44.1172679Z 
2025-05-07T20:25:44.1172690Z 
2025-05-07T20:25:44.1172693Z 
2025-05-07T20:25:44.1172697Z 
2025-05-07T20:25:44.1172701Z 
2025-05-07T20:25:44.1187250Z python-3.10.13       | 24.5 MB   | ##9        |  30% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.1187866Z 
2025-05-07T20:25:44.1187876Z 
2025-05-07T20:25:44.1187881Z 
2025-05-07T20:25:44.1187886Z 
2025-05-07T20:25:44.1187892Z 
2025-05-07T20:25:44.1187897Z 
2025-05-07T20:25:44.1187903Z 
2025-05-07T20:25:44.1188099Z 
2025-05-07T20:25:44.1188102Z 
2025-05-07T20:25:44.1188106Z 
2025-05-07T20:25:44.2177872Z gds-tools-1.11.1.6   | 37.8 MB   | #########9 | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.2178393Z 
2025-05-07T20:25:44.2178401Z 
2025-05-07T20:25:44.2178407Z 
2025-05-07T20:25:44.2178412Z 
2025-05-07T20:25:44.2178416Z 
2025-05-07T20:25:44.2178421Z 
2025-05-07T20:25:44.2178427Z 
2025-05-07T20:25:44.2178432Z 
2025-05-07T20:25:44.2178437Z 
2025-05-07T20:25:44.2178442Z 
2025-05-07T20:25:44.2178447Z 
2025-05-07T20:25:44.3178773Z python-3.10.13       | 24.5 MB   | ####4      |  45% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.3179217Z 
2025-05-07T20:25:44.3179223Z 
2025-05-07T20:25:44.3179228Z 
2025-05-07T20:25:44.3179233Z 
2025-05-07T20:25:44.3179238Z 
2025-05-07T20:25:44.3179244Z 
2025-05-07T20:25:44.3179270Z 
2025-05-07T20:25:44.3179275Z 
2025-05-07T20:25:44.3179281Z 
2025-05-07T20:25:44.3179286Z 
2025-05-07T20:25:44.3179291Z 
2025-05-07T20:25:44.4295766Z python-3.10.13       | 24.5 MB   | #####9     |  59% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4296199Z 
2025-05-07T20:25:44.4296205Z 
2025-05-07T20:25:44.4296210Z 
2025-05-07T20:25:44.4296215Z 
2025-05-07T20:25:44.4296220Z 
2025-05-07T20:25:44.4296235Z 
2025-05-07T20:25:44.4296241Z 
2025-05-07T20:25:44.4296246Z 
2025-05-07T20:25:44.4296251Z 
2025-05-07T20:25:44.4296256Z 
2025-05-07T20:25:44.4298130Z 
2025-05-07T20:25:44.5167179Z python-3.10.13       | 24.5 MB   | #######3   |  74% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5167592Z 
2025-05-07T20:25:44.5167597Z 
2025-05-07T20:25:44.5167603Z 
2025-05-07T20:25:44.5167608Z 
2025-05-07T20:25:44.5167613Z 
2025-05-07T20:25:44.5167619Z 
2025-05-07T20:25:44.5167624Z 
2025-05-07T20:25:44.5167629Z 
2025-05-07T20:25:44.5172231Z 
2025-05-07T20:25:44.5301098Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5301557Z 
2025-05-07T20:25:44.5301564Z 
2025-05-07T20:25:44.5301570Z 
2025-05-07T20:25:44.5301575Z 
2025-05-07T20:25:44.5301591Z 
2025-05-07T20:25:44.5301597Z 
2025-05-07T20:25:44.5301602Z 
2025-05-07T20:25:44.5301607Z 
2025-05-07T20:25:44.5301613Z 
2025-05-07T20:25:44.5301618Z 
2025-05-07T20:25:44.5301623Z 
2025-05-07T20:25:44.5704951Z python-3.10.13       | 24.5 MB   | ########7  |  88% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5705248Z 
2025-05-07T20:25:44.5705252Z 
2025-05-07T20:25:44.5705256Z 
2025-05-07T20:25:44.5705260Z 
2025-05-07T20:25:44.5705263Z 
2025-05-07T20:25:44.5705267Z 
2025-05-07T20:25:44.5705271Z 
2025-05-07T20:25:44.5705274Z 
2025-05-07T20:25:44.5705286Z 
2025-05-07T20:25:44.5705290Z 
2025-05-07T20:25:44.5705293Z 
2025-05-07T20:25:44.5705552Z 
2025-05-07T20:25:44.6040285Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6040669Z 
2025-05-07T20:25:44.6043107Z 
2025-05-07T20:25:44.6708267Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:25:44.6708624Z 
2025-05-07T20:25:44.6708636Z 
2025-05-07T20:25:44.6708654Z 
2025-05-07T20:25:44.6708658Z 
2025-05-07T20:25:44.6708661Z 
2025-05-07T20:25:44.6708665Z 
2025-05-07T20:25:44.6708669Z 
2025-05-07T20:25:44.6708673Z 
2025-05-07T20:25:44.6708676Z 
2025-05-07T20:25:44.6708680Z 
2025-05-07T20:25:44.6708684Z 
2025-05-07T20:25:44.6711053Z 
2025-05-07T20:25:44.7709926Z cuda-nvcc-tools-12.6 | 23.0 MB   | #3         |  13% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7710304Z 
2025-05-07T20:25:44.7710308Z 
2025-05-07T20:25:44.7710311Z 
2025-05-07T20:25:44.7710315Z 
2025-05-07T20:25:44.7710318Z 
2025-05-07T20:25:44.7710322Z 
2025-05-07T20:25:44.7710325Z 
2025-05-07T20:25:44.7710329Z 
2025-05-07T20:25:44.7710332Z 
2025-05-07T20:25:44.7710336Z 
2025-05-07T20:25:44.7710339Z 
2025-05-07T20:25:44.7711488Z 
2025-05-07T20:25:44.8710459Z cuda-nvcc-tools-12.6 | 23.0 MB   | ##7        |  27% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8710811Z 
2025-05-07T20:25:44.8710815Z 
2025-05-07T20:25:44.8710819Z 
2025-05-07T20:25:44.8710823Z 
2025-05-07T20:25:44.8710969Z 
2025-05-07T20:25:44.8710973Z 
2025-05-07T20:25:44.8710977Z 
2025-05-07T20:25:44.8710980Z 
2025-05-07T20:25:44.8710984Z 
2025-05-07T20:25:44.8710988Z 
2025-05-07T20:25:44.8710991Z 
2025-05-07T20:25:44.8711530Z 
2025-05-07T20:25:44.8871473Z cuda-nvcc-tools-12.6 | 23.0 MB   | ####2      |  43% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8871875Z 
2025-05-07T20:25:44.8871879Z 
2025-05-07T20:25:44.8871890Z 
2025-05-07T20:25:44.8871894Z 
2025-05-07T20:25:44.8871898Z 
2025-05-07T20:25:44.8871908Z 
2025-05-07T20:25:44.8871912Z 
2025-05-07T20:25:44.8871915Z 
2025-05-07T20:25:44.9517184Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9517497Z 
2025-05-07T20:25:44.9517509Z 
2025-05-07T20:25:44.9517513Z 
2025-05-07T20:25:44.9517516Z 
2025-05-07T20:25:44.9517520Z 
2025-05-07T20:25:44.9517538Z 
2025-05-07T20:25:44.9517542Z 
2025-05-07T20:25:44.9517545Z 
2025-05-07T20:25:44.9517549Z 
2025-05-07T20:25:44.9517552Z 
2025-05-07T20:25:44.9517556Z 
2025-05-07T20:25:44.9517566Z 
2025-05-07T20:25:44.9518931Z 
2025-05-07T20:25:44.9714241Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9714564Z 
2025-05-07T20:25:44.9714568Z 
2025-05-07T20:25:44.9714571Z 
2025-05-07T20:25:44.9714575Z 
2025-05-07T20:25:44.9714578Z 
2025-05-07T20:25:44.9714582Z 
2025-05-07T20:25:44.9714585Z 
2025-05-07T20:25:44.9714589Z 
2025-05-07T20:25:44.9714593Z 
2025-05-07T20:25:44.9714596Z 
2025-05-07T20:25:44.9714600Z 
2025-05-07T20:25:44.9715682Z 
2025-05-07T20:25:45.0519208Z cuda-nvcc-tools-12.6 | 23.0 MB   | #####9     |  60% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0519628Z 
2025-05-07T20:25:45.0519632Z 
2025-05-07T20:25:45.0519635Z 
2025-05-07T20:25:45.0519639Z 
2025-05-07T20:25:45.0519642Z 
2025-05-07T20:25:45.0519646Z 
2025-05-07T20:25:45.0519667Z 
2025-05-07T20:25:45.0519671Z 
2025-05-07T20:25:45.0519674Z 
2025-05-07T20:25:45.0519678Z 
2025-05-07T20:25:45.0519681Z 
2025-05-07T20:25:45.0519685Z 
2025-05-07T20:25:45.0521623Z 
2025-05-07T20:25:45.0778567Z cuda-nvrtc-12.6.85   | 17.3 MB   | #8         |  19% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0778885Z 
2025-05-07T20:25:45.0778889Z 
2025-05-07T20:25:45.0778892Z 
2025-05-07T20:25:45.0778896Z 
2025-05-07T20:25:45.0778899Z 
2025-05-07T20:25:45.0778903Z 
2025-05-07T20:25:45.0778907Z 
2025-05-07T20:25:45.0778910Z 
2025-05-07T20:25:45.0778914Z 
2025-05-07T20:25:45.0778917Z 
2025-05-07T20:25:45.0778921Z 
2025-05-07T20:25:45.0778925Z 
2025-05-07T20:25:45.1681557Z cuda-nvcc-tools-12.6 | 23.0 MB   | #######5   |  75% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1681882Z 
2025-05-07T20:25:45.1681886Z 
2025-05-07T20:25:45.1681889Z 
2025-05-07T20:25:45.1681893Z 
2025-05-07T20:25:45.1681897Z 
2025-05-07T20:25:45.1681900Z 
2025-05-07T20:25:45.1681904Z 
2025-05-07T20:25:45.1681918Z 
2025-05-07T20:25:45.1681922Z 
2025-05-07T20:25:45.1681926Z 
2025-05-07T20:25:45.1681929Z 
2025-05-07T20:25:45.1681933Z 
2025-05-07T20:25:45.1681944Z 
2025-05-07T20:25:45.1991142Z cuda-nvrtc-12.6.85   | 17.3 MB   | ###7       |  38% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1991452Z 
2025-05-07T20:25:45.1991456Z 
2025-05-07T20:25:45.1991460Z 
2025-05-07T20:25:45.1991471Z 
2025-05-07T20:25:45.1991474Z 
2025-05-07T20:25:45.1991478Z 
2025-05-07T20:25:45.1991482Z 
2025-05-07T20:25:45.1991485Z 
2025-05-07T20:25:45.1991489Z 
2025-05-07T20:25:45.1991493Z 
2025-05-07T20:25:45.1991496Z 
2025-05-07T20:25:45.1991500Z 
2025-05-07T20:25:45.2684055Z cuda-nvcc-tools-12.6 | 23.0 MB   | #########  |  91% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2684404Z 
2025-05-07T20:25:45.2684409Z 
2025-05-07T20:25:45.2684414Z 
2025-05-07T20:25:45.2684419Z 
2025-05-07T20:25:45.2684424Z 
2025-05-07T20:25:45.2684429Z 
2025-05-07T20:25:45.2684434Z 
2025-05-07T20:25:45.2684439Z 
2025-05-07T20:25:45.2684670Z 
2025-05-07T20:25:45.2684679Z 
2025-05-07T20:25:45.2684684Z 
2025-05-07T20:25:45.2684689Z 
2025-05-07T20:25:45.2684699Z 
2025-05-07T20:25:45.3756963Z cuda-nvrtc-12.6.85   | 17.3 MB   | #####7     |  57% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3757638Z 
2025-05-07T20:25:45.3757642Z 
2025-05-07T20:25:45.3757646Z 
2025-05-07T20:25:45.3757650Z 
2025-05-07T20:25:45.3757653Z 
2025-05-07T20:25:45.3757657Z 
2025-05-07T20:25:45.3757660Z 
2025-05-07T20:25:45.3757664Z 
2025-05-07T20:25:45.3757667Z 
2025-05-07T20:25:45.3761652Z 
2025-05-07T20:25:45.3770942Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3771276Z 
2025-05-07T20:25:45.3771281Z 
2025-05-07T20:25:45.3771287Z 
2025-05-07T20:25:45.3771292Z 
2025-05-07T20:25:45.3771305Z 
2025-05-07T20:25:45.3771310Z 
2025-05-07T20:25:45.3771315Z 
2025-05-07T20:25:45.3771320Z 
2025-05-07T20:25:45.3771325Z 
2025-05-07T20:25:45.3771330Z 
2025-05-07T20:25:45.3771335Z 
2025-05-07T20:25:45.3771351Z 
2025-05-07T20:25:45.3771357Z 
2025-05-07T20:25:45.4304432Z cuda-nvrtc-12.6.85   | 17.3 MB   | #######5   |  76% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4304854Z 
2025-05-07T20:25:45.4304869Z 
2025-05-07T20:25:45.4304873Z 
2025-05-07T20:25:45.4304876Z 
2025-05-07T20:25:45.4304880Z 
2025-05-07T20:25:45.4304884Z 
2025-05-07T20:25:45.4304887Z 
2025-05-07T20:25:45.4304891Z 
2025-05-07T20:25:45.4304895Z 
2025-05-07T20:25:45.4304898Z 
2025-05-07T20:25:45.4304902Z 
2025-05-07T20:25:45.4316622Z python-3.10.13       | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4316961Z 
2025-05-07T20:25:45.4316966Z 
2025-05-07T20:25:45.4316969Z 
2025-05-07T20:25:45.4316973Z 
2025-05-07T20:25:45.4316977Z 
2025-05-07T20:25:45.4316980Z 
2025-05-07T20:25:45.4316984Z 
2025-05-07T20:25:45.4316987Z 
2025-05-07T20:25:45.4316991Z 
2025-05-07T20:25:45.4316995Z 
2025-05-07T20:25:45.4316998Z 
2025-05-07T20:25:45.4317002Z 
2025-05-07T20:25:45.4317012Z 
2025-05-07T20:25:45.4317026Z 
2025-05-07T20:25:45.4778873Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4779198Z 
2025-05-07T20:25:45.4779202Z 
2025-05-07T20:25:45.4779223Z 
2025-05-07T20:25:45.4779227Z 
2025-05-07T20:25:45.4779230Z 
2025-05-07T20:25:45.4779234Z 
2025-05-07T20:25:45.4779238Z 
2025-05-07T20:25:45.4779241Z 
2025-05-07T20:25:45.4779245Z 
2025-05-07T20:25:45.4779249Z 
2025-05-07T20:25:45.4779252Z 
2025-05-07T20:25:45.4779256Z 
2025-05-07T20:25:45.4779655Z 
2025-05-07T20:25:45.4834186Z cuda-nvrtc-12.6.85   | 17.3 MB   | #########3 |  93% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4834537Z 
2025-05-07T20:25:45.4834542Z 
2025-05-07T20:25:45.4834547Z 
2025-05-07T20:25:45.4834552Z 
2025-05-07T20:25:45.4834557Z 
2025-05-07T20:25:45.4834562Z 
2025-05-07T20:25:45.4834567Z 
2025-05-07T20:25:45.4834572Z 
2025-05-07T20:25:45.4834578Z 
2025-05-07T20:25:45.4834583Z 
2025-05-07T20:25:45.4834588Z 
2025-05-07T20:25:45.4834593Z 
2025-05-07T20:25:45.4834608Z 
2025-05-07T20:25:45.4834614Z 
2025-05-07T20:25:45.4836302Z 
2025-05-07T20:25:45.5319415Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5319885Z 
2025-05-07T20:25:45.5319891Z 
2025-05-07T20:25:45.5319896Z 
2025-05-07T20:25:45.5319901Z 
2025-05-07T20:25:45.5319915Z 
2025-05-07T20:25:45.5319921Z 
2025-05-07T20:25:45.5319926Z 
2025-05-07T20:25:45.5319931Z 
2025-05-07T20:25:45.5319937Z 
2025-05-07T20:25:45.5319942Z 
2025-05-07T20:25:45.5319947Z 
2025-05-07T20:25:45.5319952Z 
2025-05-07T20:25:45.5319958Z 
2025-05-07T20:25:45.5321959Z 
2025-05-07T20:25:45.5839138Z libnvjitlink-12.6.85 | 14.9 MB   | ##1        |  21% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5839610Z 
2025-05-07T20:25:45.5839615Z 
2025-05-07T20:25:45.5839620Z 
2025-05-07T20:25:45.5839626Z 
2025-05-07T20:25:45.5839631Z 
2025-05-07T20:25:45.5839636Z 
2025-05-07T20:25:45.5839641Z 
2025-05-07T20:25:45.5839646Z 
2025-05-07T20:25:45.5839651Z 
2025-05-07T20:25:45.5839901Z 
2025-05-07T20:25:45.5839908Z 
2025-05-07T20:25:45.5839913Z 
2025-05-07T20:25:45.5839918Z 
2025-05-07T20:25:45.5839924Z 
2025-05-07T20:25:45.5839929Z 
2025-05-07T20:25:45.6472592Z cuda-nvcc-dev_linux- | 10.8 MB   | ##6        |  26% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6473063Z 
2025-05-07T20:25:45.6473068Z 
2025-05-07T20:25:45.6473073Z 
2025-05-07T20:25:45.6473078Z 
2025-05-07T20:25:45.6473084Z 
2025-05-07T20:25:45.6473103Z 
2025-05-07T20:25:45.6473109Z 
2025-05-07T20:25:45.6473114Z 
2025-05-07T20:25:45.6473119Z 
2025-05-07T20:25:45.6473124Z 
2025-05-07T20:25:45.6473129Z 
2025-05-07T20:25:45.6473134Z 
2025-05-07T20:25:45.6473139Z 
2025-05-07T20:25:45.6473145Z 
2025-05-07T20:25:45.6839172Z libnvjitlink-12.6.85 | 14.9 MB   | ####2      |  42% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6839636Z 
2025-05-07T20:25:45.6839641Z 
2025-05-07T20:25:45.6839646Z 
2025-05-07T20:25:45.6839651Z 
2025-05-07T20:25:45.6839657Z 
2025-05-07T20:25:45.6839664Z 
2025-05-07T20:25:45.6839687Z 
2025-05-07T20:25:45.6839692Z 
2025-05-07T20:25:45.6839697Z 
2025-05-07T20:25:45.6839702Z 
2025-05-07T20:25:45.6839708Z 
2025-05-07T20:25:45.6839713Z 
2025-05-07T20:25:45.6839726Z 
2025-05-07T20:25:45.6839731Z 
2025-05-07T20:25:45.6839736Z 
2025-05-07T20:25:45.7472491Z cuda-nvcc-dev_linux- | 10.8 MB   | #####3     |  53% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7472927Z 
2025-05-07T20:25:45.7472931Z 
2025-05-07T20:25:45.7472934Z 
2025-05-07T20:25:45.7472938Z 
2025-05-07T20:25:45.7472941Z 
2025-05-07T20:25:45.7472952Z 
2025-05-07T20:25:45.7472955Z 
2025-05-07T20:25:45.7472959Z 
2025-05-07T20:25:45.7472962Z 
2025-05-07T20:25:45.7472966Z 
2025-05-07T20:25:45.7472970Z 
2025-05-07T20:25:45.7472973Z 
2025-05-07T20:25:45.7472977Z 
2025-05-07T20:25:45.7472980Z 
2025-05-07T20:25:45.7843009Z libnvjitlink-12.6.85 | 14.9 MB   | ######3    |  64% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7843427Z 
2025-05-07T20:25:45.7843431Z 
2025-05-07T20:25:45.7843450Z 
2025-05-07T20:25:45.7843455Z 
2025-05-07T20:25:45.7843459Z 
2025-05-07T20:25:45.7843464Z 
2025-05-07T20:25:45.7843468Z 
2025-05-07T20:25:45.7843472Z 
2025-05-07T20:25:45.7843475Z 
2025-05-07T20:25:45.7843487Z 
2025-05-07T20:25:45.7843491Z 
2025-05-07T20:25:45.7843494Z 
2025-05-07T20:25:45.7843498Z 
2025-05-07T20:25:45.7843501Z 
2025-05-07T20:25:45.7843505Z 
2025-05-07T20:25:45.8475466Z cuda-nvcc-dev_linux- | 10.8 MB   | ########   |  80% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.8475847Z 
2025-05-07T20:25:45.8475851Z 
2025-05-07T20:25:45.8475854Z 
2025-05-07T20:25:45.8475858Z 
2025-05-07T20:25:45.8475862Z 
2025-05-07T20:25:45.8475865Z 
2025-05-07T20:25:45.8475869Z 
2025-05-07T20:25:45.8475872Z 
2025-05-07T20:25:45.8475883Z 
2025-05-07T20:25:45.8475886Z 
2025-05-07T20:25:45.8475890Z 
2025-05-07T20:25:45.8475894Z 
2025-05-07T20:25:45.8475897Z 
2025-05-07T20:25:45.8475901Z 
2025-05-07T20:25:46.0261918Z libnvjitlink-12.6.85 | 14.9 MB   | ########5  |  85% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0262264Z 
2025-05-07T20:25:46.0262268Z 
2025-05-07T20:25:46.0262272Z 
2025-05-07T20:25:46.0262275Z 
2025-05-07T20:25:46.0262286Z 
2025-05-07T20:25:46.0262300Z 
2025-05-07T20:25:46.0262304Z 
2025-05-07T20:25:46.0262308Z 
2025-05-07T20:25:46.0262311Z 
2025-05-07T20:25:46.0262315Z 
2025-05-07T20:25:46.0262318Z 
2025-05-07T20:25:46.0262322Z 
2025-05-07T20:25:46.0900471Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0900929Z 
2025-05-07T20:25:46.0900935Z 
2025-05-07T20:25:46.0900940Z 
2025-05-07T20:25:46.0900946Z 
2025-05-07T20:25:46.0900951Z 
2025-05-07T20:25:46.0900956Z 
2025-05-07T20:25:46.0900962Z 
2025-05-07T20:25:46.0900967Z 
2025-05-07T20:25:46.0900972Z 
2025-05-07T20:25:46.0900977Z 
2025-05-07T20:25:46.0900983Z 
2025-05-07T20:25:46.0900988Z 
2025-05-07T20:25:46.0904780Z 
2025-05-07T20:25:46.0940252Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0940926Z 
2025-05-07T20:25:46.0940931Z 
2025-05-07T20:25:46.0940934Z 
2025-05-07T20:25:46.0940938Z 
2025-05-07T20:25:46.0940942Z 
2025-05-07T20:25:46.0940945Z 
2025-05-07T20:25:46.0940949Z 
2025-05-07T20:25:46.0941108Z 
2025-05-07T20:25:46.0941113Z 
2025-05-07T20:25:46.0941118Z 
2025-05-07T20:25:46.0941134Z 
2025-05-07T20:25:46.0941140Z 
2025-05-07T20:25:46.0941145Z 
2025-05-07T20:25:46.0941150Z 
2025-05-07T20:25:46.0941155Z 
2025-05-07T20:25:46.0942794Z 
2025-05-07T20:25:46.1487558Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.1488044Z 
2025-05-07T20:25:46.1488238Z 
2025-05-07T20:25:46.1488248Z 
2025-05-07T20:25:46.1488252Z 
2025-05-07T20:25:46.1488396Z 
2025-05-07T20:25:46.1488404Z 
2025-05-07T20:25:46.1488407Z 
2025-05-07T20:25:46.1488411Z 
2025-05-07T20:25:46.1488415Z 
2025-05-07T20:25:46.1488418Z 
2025-05-07T20:25:46.1488446Z 
2025-05-07T20:25:46.1488450Z 
2025-05-07T20:25:46.1488453Z 
2025-05-07T20:25:46.1488457Z 
2025-05-07T20:25:46.1488475Z 
2025-05-07T20:25:46.1488479Z 
2025-05-07T20:25:46.1488975Z 
2025-05-07T20:25:46.1953528Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.1953977Z 
2025-05-07T20:25:46.1953983Z 
2025-05-07T20:25:46.1953988Z 
2025-05-07T20:25:46.1953993Z 
2025-05-07T20:25:46.1953998Z 
2025-05-07T20:25:46.1954004Z 
2025-05-07T20:25:46.1954009Z 
2025-05-07T20:25:46.1954014Z 
2025-05-07T20:25:46.1954020Z 
2025-05-07T20:25:46.1954025Z 
2025-05-07T20:25:46.1954030Z 
2025-05-07T20:25:46.1954035Z 
2025-05-07T20:25:46.1954040Z 
2025-05-07T20:25:46.1954045Z 
2025-05-07T20:25:46.1954050Z 
2025-05-07T20:25:46.1954056Z 
2025-05-07T20:25:46.2156075Z cuda-nvvm-tools-12.6 | 10.4 MB   | ###7       |  37% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.2156479Z 
2025-05-07T20:25:46.2156484Z 
2025-05-07T20:25:46.2156487Z 
2025-05-07T20:25:46.2156491Z 
2025-05-07T20:25:46.2156495Z 
2025-05-07T20:25:46.2156498Z 
2025-05-07T20:25:46.2156512Z 
2025-05-07T20:25:46.2156516Z 
2025-05-07T20:25:46.2156519Z 
2025-05-07T20:25:46.2156530Z 
2025-05-07T20:25:46.2156534Z 
2025-05-07T20:25:46.2156538Z 
2025-05-07T20:25:46.2156541Z 
2025-05-07T20:25:46.2156549Z 
2025-05-07T20:25:46.2156552Z 
2025-05-07T20:25:46.2488385Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.2488731Z 
2025-05-07T20:25:46.2488734Z 
2025-05-07T20:25:46.2488738Z 
2025-05-07T20:25:46.2488741Z 
2025-05-07T20:25:46.2488745Z 
2025-05-07T20:25:46.2488749Z 
2025-05-07T20:25:46.2488752Z 
2025-05-07T20:25:46.2488756Z 
2025-05-07T20:25:46.2488760Z 
2025-05-07T20:25:46.2488764Z 
2025-05-07T20:25:46.2488767Z 
2025-05-07T20:25:46.2488771Z 
2025-05-07T20:25:46.2488775Z 
2025-05-07T20:25:46.2488778Z 
2025-05-07T20:25:46.2488782Z 
2025-05-07T20:25:46.2488786Z 
2025-05-07T20:25:46.2488789Z 
2025-05-07T20:25:46.2572069Z cuda-sanitizer-api-1 | 8.9 MB    | ###        |  31% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.2572433Z 
2025-05-07T20:25:46.2572437Z 
2025-05-07T20:25:46.2572441Z 
2025-05-07T20:25:46.2572444Z 
2025-05-07T20:25:46.2572448Z 
2025-05-07T20:25:46.2572452Z 
2025-05-07T20:25:46.2572459Z 
2025-05-07T20:25:46.2572463Z 
2025-05-07T20:25:46.2572466Z 
2025-05-07T20:25:46.2572470Z 
2025-05-07T20:25:46.2572473Z 
2025-05-07T20:25:46.2572477Z 
2025-05-07T20:25:46.2572481Z 
2025-05-07T20:25:46.2572484Z 
2025-05-07T20:25:46.2572488Z 
2025-05-07T20:25:46.2572500Z 
2025-05-07T20:25:46.2572504Z 
2025-05-07T20:25:46.2572507Z 
2025-05-07T20:25:46.3222723Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3223082Z 
2025-05-07T20:25:46.3223088Z 
2025-05-07T20:25:46.3223094Z 
2025-05-07T20:25:46.3223099Z 
2025-05-07T20:25:46.3223104Z 
2025-05-07T20:25:46.3223110Z 
2025-05-07T20:25:46.3223115Z 
2025-05-07T20:25:46.3223120Z 
2025-05-07T20:25:46.3223125Z 
2025-05-07T20:25:46.3223130Z 
2025-05-07T20:25:46.3223136Z 
2025-05-07T20:25:46.3223366Z 
2025-05-07T20:25:46.3223374Z 
2025-05-07T20:25:46.3223379Z 
2025-05-07T20:25:46.3223384Z 
2025-05-07T20:25:46.3223389Z 
2025-05-07T20:25:46.3489119Z cuda-nvvm-tools-12.6 | 10.4 MB   | #######4   |  74% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3489660Z 
2025-05-07T20:25:46.3489664Z 
2025-05-07T20:25:46.3489668Z 
2025-05-07T20:25:46.3489671Z 
2025-05-07T20:25:46.3489675Z 
2025-05-07T20:25:46.3489678Z 
2025-05-07T20:25:46.3489682Z 
2025-05-07T20:25:46.3489686Z 
2025-05-07T20:25:46.3489689Z 
2025-05-07T20:25:46.3489693Z 
2025-05-07T20:25:46.3489696Z 
2025-05-07T20:25:46.3489700Z 
2025-05-07T20:25:46.3489709Z 
2025-05-07T20:25:46.3489713Z 
2025-05-07T20:25:46.3489716Z 
2025-05-07T20:25:46.3489720Z 
2025-05-07T20:25:46.3493082Z 
2025-05-07T20:25:46.3575018Z cuda-sanitizer-api-1 | 8.9 MB    | ######3    |  63% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.3575372Z 
2025-05-07T20:25:46.3575375Z 
2025-05-07T20:25:46.3575379Z 
2025-05-07T20:25:46.3575382Z 
2025-05-07T20:25:46.3575396Z 
2025-05-07T20:25:46.3575400Z 
2025-05-07T20:25:46.3575403Z 
2025-05-07T20:25:46.3575411Z 
2025-05-07T20:25:46.3575415Z 
2025-05-07T20:25:46.3575420Z 
2025-05-07T20:25:46.3575432Z 
2025-05-07T20:25:46.3575437Z 
2025-05-07T20:25:46.3575442Z 
2025-05-07T20:25:46.3575447Z 
2025-05-07T20:25:46.3575453Z 
2025-05-07T20:25:46.3575458Z 
2025-05-07T20:25:46.3575463Z 
2025-05-07T20:25:46.3575468Z 
2025-05-07T20:25:46.4387475Z cuda-nvvm-impl-12.6. | 7.7 MB    | ###4       |  35% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.4387819Z 
2025-05-07T20:25:46.4387823Z 
2025-05-07T20:25:46.4387827Z 
2025-05-07T20:25:46.4387830Z 
2025-05-07T20:25:46.4387834Z 
2025-05-07T20:25:46.4387838Z 
2025-05-07T20:25:46.4387841Z 
2025-05-07T20:25:46.4387853Z 
2025-05-07T20:25:46.4387856Z 
2025-05-07T20:25:46.4387860Z 
2025-05-07T20:25:46.4387864Z 
2025-05-07T20:25:46.4387867Z 
2025-05-07T20:25:46.4387871Z 
2025-05-07T20:25:46.4390255Z 
2025-05-07T20:25:46.4489430Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.4489770Z 
2025-05-07T20:25:46.4489774Z 
2025-05-07T20:25:46.4489777Z 
2025-05-07T20:25:46.4489786Z 
2025-05-07T20:25:46.4489790Z 
2025-05-07T20:25:46.4489794Z 
2025-05-07T20:25:46.4489797Z 
2025-05-07T20:25:46.4489801Z 
2025-05-07T20:25:46.4489805Z 
2025-05-07T20:25:46.4489808Z 
2025-05-07T20:25:46.4489812Z 
2025-05-07T20:25:46.4489816Z 
2025-05-07T20:25:46.4489819Z 
2025-05-07T20:25:46.4489823Z 
2025-05-07T20:25:46.4489827Z 
2025-05-07T20:25:46.4489830Z 
2025-05-07T20:25:46.4491579Z 
2025-05-07T20:25:46.4580468Z cuda-sanitizer-api-1 | 8.9 MB    | #########5 |  96% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.4580819Z 
2025-05-07T20:25:46.4580827Z 
2025-05-07T20:25:46.4580832Z 
2025-05-07T20:25:46.4580837Z 
2025-05-07T20:25:46.4580842Z 
2025-05-07T20:25:46.4580847Z 
2025-05-07T20:25:46.4580861Z 
2025-05-07T20:25:46.4580866Z 
2025-05-07T20:25:46.4580879Z 
2025-05-07T20:25:46.4580884Z 
2025-05-07T20:25:46.4580897Z 
2025-05-07T20:25:46.4580903Z 
2025-05-07T20:25:46.4580908Z 
2025-05-07T20:25:46.4580913Z 
2025-05-07T20:25:46.4580918Z 
2025-05-07T20:25:46.4580923Z 
2025-05-07T20:25:46.4580934Z 
2025-05-07T20:25:46.4581057Z 
2025-05-07T20:25:46.4899033Z cuda-nvvm-impl-12.6. | 7.7 MB    | #######    |  71% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.4899494Z 
2025-05-07T20:25:46.4899498Z 
2025-05-07T20:25:46.4899502Z 
2025-05-07T20:25:46.4899506Z 
2025-05-07T20:25:46.4899509Z 
2025-05-07T20:25:46.4899513Z 
2025-05-07T20:25:46.4899517Z 
2025-05-07T20:25:46.4899520Z 
2025-05-07T20:25:46.4899524Z 
2025-05-07T20:25:46.4899528Z 
2025-05-07T20:25:46.4899531Z 
2025-05-07T20:25:46.4899535Z 
2025-05-07T20:25:46.4899539Z 
2025-05-07T20:25:46.4899542Z 
2025-05-07T20:25:46.4899546Z 
2025-05-07T20:25:46.4899550Z 
2025-05-07T20:25:46.4899553Z 
2025-05-07T20:25:46.4899563Z 
2025-05-07T20:25:46.4902207Z 
2025-05-07T20:25:46.5901601Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.5901921Z 
2025-05-07T20:25:46.5901925Z 
2025-05-07T20:25:46.5901929Z 
2025-05-07T20:25:46.5901932Z 
2025-05-07T20:25:46.5901936Z 
2025-05-07T20:25:46.5902104Z 
2025-05-07T20:25:46.5902110Z 
2025-05-07T20:25:46.5902114Z 
2025-05-07T20:25:46.5902117Z 
2025-05-07T20:25:46.5902121Z 
2025-05-07T20:25:46.5902124Z 
2025-05-07T20:25:46.5902128Z 
2025-05-07T20:25:46.5902132Z 
2025-05-07T20:25:46.5902135Z 
2025-05-07T20:25:46.5902139Z 
2025-05-07T20:25:46.5902143Z 
2025-05-07T20:25:46.5902146Z 
2025-05-07T20:25:46.5902150Z 
2025-05-07T20:25:46.5902153Z 
2025-05-07T20:25:46.7463539Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.7463839Z 
2025-05-07T20:25:46.7463843Z 
2025-05-07T20:25:46.7463846Z 
2025-05-07T20:25:46.7463850Z 
2025-05-07T20:25:46.7463854Z 
2025-05-07T20:25:46.7463857Z 
2025-05-07T20:25:46.7463861Z 
2025-05-07T20:25:46.7463865Z 
2025-05-07T20:25:46.7463875Z 
2025-05-07T20:25:46.7463879Z 
2025-05-07T20:25:46.7463893Z 
2025-05-07T20:25:46.7463897Z 
2025-05-07T20:25:46.7463901Z 
2025-05-07T20:25:46.7463904Z 
2025-05-07T20:25:46.7463908Z 
2025-05-07T20:25:46.7463912Z 
2025-05-07T20:25:46.7463921Z 
2025-05-07T20:25:46.7463925Z 
2025-05-07T20:25:46.7463929Z 
2025-05-07T20:25:46.7479417Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.7479717Z 
2025-05-07T20:25:46.7479721Z 
2025-05-07T20:25:46.7479724Z 
2025-05-07T20:25:46.7479728Z 
2025-05-07T20:25:46.7479732Z 
2025-05-07T20:25:46.7479735Z 
2025-05-07T20:25:46.7479739Z 
2025-05-07T20:25:46.7479743Z 
2025-05-07T20:25:46.7479746Z 
2025-05-07T20:25:46.7479750Z 
2025-05-07T20:25:46.7479753Z 
2025-05-07T20:25:46.7479757Z 
2025-05-07T20:25:46.7479760Z 
2025-05-07T20:25:46.7479770Z 
2025-05-07T20:25:46.7479774Z 
2025-05-07T20:25:46.7479777Z 
2025-05-07T20:25:46.7479781Z 
2025-05-07T20:25:46.7532511Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.7532966Z 
2025-05-07T20:25:46.7532970Z 
2025-05-07T20:25:46.7532974Z 
2025-05-07T20:25:46.7532978Z 
2025-05-07T20:25:46.7532981Z 
2025-05-07T20:25:46.7532985Z 
2025-05-07T20:25:46.7532994Z 
2025-05-07T20:25:46.7532998Z 
2025-05-07T20:25:46.7533001Z 
2025-05-07T20:25:46.7533005Z 
2025-05-07T20:25:46.7533008Z 
2025-05-07T20:25:46.7533012Z 
2025-05-07T20:25:46.7533016Z 
2025-05-07T20:25:46.7533019Z 
2025-05-07T20:25:46.7533023Z 
2025-05-07T20:25:46.7533491Z 
2025-05-07T20:25:46.7601322Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.7601663Z 
2025-05-07T20:25:46.7601667Z 
2025-05-07T20:25:46.7601671Z 
2025-05-07T20:25:46.7601674Z 
2025-05-07T20:25:46.7601678Z 
2025-05-07T20:25:46.7601682Z 
2025-05-07T20:25:46.7601685Z 
2025-05-07T20:25:46.7601689Z 
2025-05-07T20:25:46.7601692Z 
2025-05-07T20:25:46.7601696Z 
2025-05-07T20:25:46.7601707Z 
2025-05-07T20:25:46.7601711Z 
2025-05-07T20:25:46.7601714Z 
2025-05-07T20:25:46.7601718Z 
2025-05-07T20:25:46.7601730Z 
2025-05-07T20:25:46.7601733Z 
2025-05-07T20:25:46.7601737Z 
2025-05-07T20:25:46.7601740Z 
2025-05-07T20:25:48.1054351Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.1054734Z 
2025-05-07T20:25:48.1054738Z 
2025-05-07T20:25:48.1054741Z 
2025-05-07T20:25:48.1054745Z 
2025-05-07T20:25:48.1054748Z 
2025-05-07T20:25:48.1056190Z 
2025-05-07T20:25:49.0332937Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:49.0333279Z 
2025-05-07T20:25:49.0333283Z 
2025-05-07T20:25:49.0333287Z 
2025-05-07T20:25:49.0333291Z 
2025-05-07T20:25:49.0333294Z 
2025-05-07T20:25:49.3398915Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:49.3399243Z 
2025-05-07T20:25:49.3399247Z 
2025-05-07T20:25:49.3399251Z 
2025-05-07T20:25:49.3399254Z 
2025-05-07T20:25:49.3399258Z 
2025-05-07T20:25:49.3399262Z 
2025-05-07T20:25:49.3399266Z 
2025-05-07T20:25:49.7016807Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:49.7017136Z 
2025-05-07T20:25:49.7017140Z 
2025-05-07T20:25:49.7017144Z 
2025-05-07T20:25:49.7017147Z 
2025-05-07T20:25:49.7017311Z 
2025-05-07T20:25:49.7017315Z 
2025-05-07T20:25:49.7017318Z 
2025-05-07T20:25:49.7017322Z 
2025-05-07T20:25:49.7017472Z 
2025-05-07T20:25:49.7371169Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.7453936Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:25:49.7454351Z 
2025-05-07T20:25:49.7454355Z 
2025-05-07T20:25:49.7454360Z 
2025-05-07T20:25:49.7454365Z 
2025-05-07T20:25:49.7454370Z 
2025-05-07T20:25:49.7454375Z 
2025-05-07T20:25:49.7454381Z 
2025-05-07T20:25:49.7454832Z 
2025-05-07T20:25:50.0547737Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:50.0548074Z 
2025-05-07T20:25:50.0548078Z 
2025-05-07T20:25:50.0548082Z 
2025-05-07T20:25:50.0548086Z 
2025-05-07T20:25:50.0548117Z 
2025-05-07T20:25:50.0548121Z 
2025-05-07T20:25:50.0548125Z 
2025-05-07T20:25:50.0548128Z 
2025-05-07T20:25:50.0548132Z 
2025-05-07T20:25:50.0548136Z 
2025-05-07T20:25:50.6258561Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.6258916Z 
2025-05-07T20:25:50.6258929Z 
2025-05-07T20:25:50.6258933Z 
2025-05-07T20:25:50.6258936Z 
2025-05-07T20:25:50.6258940Z 
2025-05-07T20:25:50.6258944Z 
2025-05-07T20:25:50.6258949Z 
2025-05-07T20:25:50.6258952Z 
2025-05-07T20:25:50.6258956Z 
2025-05-07T20:25:50.6258960Z 
2025-05-07T20:25:50.6258965Z 
2025-05-07T20:25:50.6258968Z 
2025-05-07T20:25:51.0147961Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.0148322Z 
2025-05-07T20:25:51.0148327Z 
2025-05-07T20:25:51.0148331Z 
2025-05-07T20:25:51.0148335Z 
2025-05-07T20:25:51.0148339Z 
2025-05-07T20:25:51.0148343Z 
2025-05-07T20:25:51.0148347Z 
2025-05-07T20:25:51.0148350Z 
2025-05-07T20:25:51.0148354Z 
2025-05-07T20:25:51.0148383Z 
2025-05-07T20:25:51.0148387Z 
2025-05-07T20:25:51.0148391Z 
2025-05-07T20:25:51.0148423Z 
2025-05-07T20:25:51.1615946Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.1616588Z 
2025-05-07T20:25:51.1616596Z 
2025-05-07T20:25:51.1616604Z 
2025-05-07T20:25:51.1616611Z 
2025-05-07T20:25:51.1616619Z 
2025-05-07T20:25:51.1616626Z 
2025-05-07T20:25:51.1616634Z 
2025-05-07T20:25:51.1616641Z 
2025-05-07T20:25:51.1616662Z 
2025-05-07T20:25:51.1616669Z 
2025-05-07T20:25:51.1616684Z 
2025-05-07T20:25:51.2974216Z python-3.10.13       | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.2974541Z 
2025-05-07T20:25:51.2974546Z 
2025-05-07T20:25:51.2974549Z 
2025-05-07T20:25:51.2974553Z 
2025-05-07T20:25:51.2974556Z 
2025-05-07T20:25:51.2974560Z 
2025-05-07T20:25:51.2974564Z 
2025-05-07T20:25:51.2974569Z 
2025-05-07T20:25:51.2974574Z 
2025-05-07T20:25:51.2974578Z 
2025-05-07T20:25:51.2974581Z 
2025-05-07T20:25:51.2974606Z 
2025-05-07T20:25:51.2974610Z 
2025-05-07T20:25:51.2974613Z 
2025-05-07T20:25:51.2974617Z 
2025-05-07T20:25:51.3913156Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.3913529Z 
2025-05-07T20:25:51.3913533Z 
2025-05-07T20:25:51.3913536Z 
2025-05-07T20:25:51.3913540Z 
2025-05-07T20:25:51.3913544Z 
2025-05-07T20:25:51.3913547Z 
2025-05-07T20:25:51.3913551Z 
2025-05-07T20:25:51.3913555Z 
2025-05-07T20:25:51.3913558Z 
2025-05-07T20:25:51.3913562Z 
2025-05-07T20:25:51.3913566Z 
2025-05-07T20:25:51.3913569Z 
2025-05-07T20:25:51.3913579Z 
2025-05-07T20:25:51.3913582Z 
2025-05-07T20:25:51.5122627Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.5122981Z 
2025-05-07T20:25:51.5122985Z 
2025-05-07T20:25:51.5122988Z 
2025-05-07T20:25:51.5122992Z 
2025-05-07T20:25:51.5122996Z 
2025-05-07T20:25:51.5123001Z 
2025-05-07T20:25:51.5123004Z 
2025-05-07T20:25:51.5123008Z 
2025-05-07T20:25:51.5123272Z 
2025-05-07T20:25:51.5123288Z 
2025-05-07T20:25:51.5123292Z 
2025-05-07T20:25:51.5123296Z 
2025-05-07T20:25:51.5123299Z 
2025-05-07T20:25:51.5123303Z 
2025-05-07T20:25:51.5123490Z 
2025-05-07T20:25:51.5123494Z 
2025-05-07T20:25:51.5123498Z 
2025-05-07T20:25:51.5123501Z 
2025-05-07T20:25:51.5123505Z 
2025-05-07T20:25:51.6051086Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.6051502Z 
2025-05-07T20:25:51.6051508Z 
2025-05-07T20:25:51.6051513Z 
2025-05-07T20:25:51.6051531Z 
2025-05-07T20:25:51.6051553Z 
2025-05-07T20:25:51.6051557Z 
2025-05-07T20:25:51.6051561Z 
2025-05-07T20:25:51.6051564Z 
2025-05-07T20:25:51.6051568Z 
2025-05-07T20:25:51.6051572Z 
2025-05-07T20:25:51.6051575Z 
2025-05-07T20:25:51.6051579Z 
2025-05-07T20:25:51.6051583Z 
2025-05-07T20:25:51.6051586Z 
2025-05-07T20:25:51.6051590Z 
2025-05-07T20:25:51.6051593Z 
2025-05-07T20:25:51.6051597Z 
2025-05-07T20:25:51.6853325Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.6853803Z 
2025-05-07T20:25:51.6853807Z 
2025-05-07T20:25:51.6853811Z 
2025-05-07T20:25:51.6853814Z 
2025-05-07T20:25:51.6853832Z 
2025-05-07T20:25:51.6853835Z 
2025-05-07T20:25:51.6853839Z 
2025-05-07T20:25:51.6853843Z 
2025-05-07T20:25:51.6853846Z 
2025-05-07T20:25:51.6853850Z 
2025-05-07T20:25:51.6853854Z 
2025-05-07T20:25:51.6853857Z 
2025-05-07T20:25:51.6853861Z 
2025-05-07T20:25:51.6853865Z 
2025-05-07T20:25:51.6853868Z 
2025-05-07T20:25:51.6853879Z 
2025-05-07T20:25:51.7251648Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.7252131Z 
2025-05-07T20:25:51.7252136Z 
2025-05-07T20:25:51.7252149Z 
2025-05-07T20:25:51.7252154Z 
2025-05-07T20:25:51.7252160Z 
2025-05-07T20:25:51.7252165Z 
2025-05-07T20:25:51.7252171Z 
2025-05-07T20:25:51.7252177Z 
2025-05-07T20:25:51.7252183Z 
2025-05-07T20:25:51.7252190Z 
2025-05-07T20:25:51.7252197Z 
2025-05-07T20:25:51.7252203Z 
2025-05-07T20:25:51.7252232Z 
2025-05-07T20:25:51.7252238Z 
2025-05-07T20:25:51.7252243Z 
2025-05-07T20:25:51.7252249Z 
2025-05-07T20:25:51.7252255Z 
2025-05-07T20:25:51.7252260Z 
2025-05-07T20:25:52.4615143Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.4616185Z 
2025-05-07T20:25:57.7973960Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:25:57.7981956Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:25:57.7982426Z 
2025-05-07T20:25:57.7982588Z 
2025-05-07T20:25:57.7982597Z 
2025-05-07T20:25:57.7982684Z 
2025-05-07T20:25:57.7982692Z 
2025-05-07T20:25:57.7982698Z 
2025-05-07T20:25:57.7982705Z 
2025-05-07T20:25:57.7982712Z 
2025-05-07T20:25:57.7982741Z 
2025-05-07T20:25:57.7982749Z 
2025-05-07T20:25:57.7982755Z 
2025-05-07T20:25:57.7982762Z 
2025-05-07T20:25:57.7982768Z 
2025-05-07T20:25:57.7982775Z 
2025-05-07T20:25:57.7982782Z 
2025-05-07T20:25:57.7982795Z 
2025-05-07T20:25:57.7982829Z 
2025-05-07T20:25:57.7982836Z 
2025-05-07T20:25:57.7982843Z 
2025-05-07T20:25:57.7983011Z                       
2025-05-07T20:25:57.7983599Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.7984195Z                                                      
2025-05-07T20:25:57.7984517Z 
2025-05-07T20:25:57.7984816Z                                                      [A
2025-05-07T20:25:57.7985180Z 
2025-05-07T20:25:57.7985187Z 
2025-05-07T20:25:57.7985448Z                                                      [A[A
2025-05-07T20:25:57.7985826Z 
2025-05-07T20:25:57.7985833Z 
2025-05-07T20:25:57.7985840Z 
2025-05-07T20:25:57.7986128Z                                                      [A[A[A
2025-05-07T20:25:57.7986463Z 
2025-05-07T20:25:57.7986469Z 
2025-05-07T20:25:57.7986476Z 
2025-05-07T20:25:57.7986483Z 
2025-05-07T20:25:57.7986822Z                                                      [A[A[A[A
2025-05-07T20:25:57.7987192Z 
2025-05-07T20:25:57.7987198Z 
2025-05-07T20:25:57.7987492Z 
2025-05-07T20:25:57.7987501Z 
2025-05-07T20:25:57.7987507Z 
2025-05-07T20:25:57.7987825Z                                                      [A[A[A[A[A
2025-05-07T20:25:57.7988351Z 
2025-05-07T20:25:57.7988358Z 
2025-05-07T20:25:57.7988364Z 
2025-05-07T20:25:57.7988369Z 
2025-05-07T20:25:57.7988375Z 
2025-05-07T20:25:57.7988381Z 
2025-05-07T20:25:57.7988707Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:57.7989065Z 
2025-05-07T20:25:57.7989071Z 
2025-05-07T20:25:57.7989076Z 
2025-05-07T20:25:57.7989081Z 
2025-05-07T20:25:57.7989086Z 
2025-05-07T20:25:57.7989091Z 
2025-05-07T20:25:57.7989097Z 
2025-05-07T20:25:57.7989675Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:57.7990039Z 
2025-05-07T20:25:57.7990045Z 
2025-05-07T20:25:57.7990051Z 
2025-05-07T20:25:57.7990056Z 
2025-05-07T20:25:57.7990062Z 
2025-05-07T20:25:57.7990067Z 
2025-05-07T20:25:57.7990073Z 
2025-05-07T20:25:57.7990079Z 
2025-05-07T20:25:57.7990451Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.7990830Z 
2025-05-07T20:25:57.7990836Z 
2025-05-07T20:25:57.7990842Z 
2025-05-07T20:25:57.7990856Z 
2025-05-07T20:25:57.7990862Z 
2025-05-07T20:25:57.7990867Z 
2025-05-07T20:25:57.7990872Z 
2025-05-07T20:25:57.7990877Z 
2025-05-07T20:25:57.7990887Z 
2025-05-07T20:25:57.7991531Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.7991881Z 
2025-05-07T20:25:57.7991887Z 
2025-05-07T20:25:57.7991893Z 
2025-05-07T20:25:57.7991898Z 
2025-05-07T20:25:57.7991904Z 
2025-05-07T20:25:57.7991910Z 
2025-05-07T20:25:57.7991915Z 
2025-05-07T20:25:57.7991920Z 
2025-05-07T20:25:57.7991925Z 
2025-05-07T20:25:57.7991931Z 
2025-05-07T20:25:57.7992314Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.7992556Z 
2025-05-07T20:25:57.7992560Z 
2025-05-07T20:25:57.7992564Z 
2025-05-07T20:25:57.7992567Z 
2025-05-07T20:25:57.7992581Z 
2025-05-07T20:25:57.7992585Z 
2025-05-07T20:25:57.7992588Z 
2025-05-07T20:25:57.7992592Z 
2025-05-07T20:25:57.7992596Z 
2025-05-07T20:25:57.7992599Z 
2025-05-07T20:25:57.7992606Z 
2025-05-07T20:25:57.7993125Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.7993496Z 
2025-05-07T20:25:57.7993502Z 
2025-05-07T20:25:57.7993507Z 
2025-05-07T20:25:57.7993512Z 
2025-05-07T20:25:57.7993517Z 
2025-05-07T20:25:57.7993523Z 
2025-05-07T20:25:57.7993528Z 
2025-05-07T20:25:57.7993533Z 
2025-05-07T20:25:57.7993549Z 
2025-05-07T20:25:57.7993556Z 
2025-05-07T20:25:57.7993562Z 
2025-05-07T20:25:57.7993573Z 
2025-05-07T20:25:57.7993941Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.7994343Z 
2025-05-07T20:25:57.7994349Z 
2025-05-07T20:25:57.7994354Z 
2025-05-07T20:25:57.7994359Z 
2025-05-07T20:25:57.7994364Z 
2025-05-07T20:25:57.7994369Z 
2025-05-07T20:25:57.7994374Z 
2025-05-07T20:25:57.7994380Z 
2025-05-07T20:25:57.7994394Z 
2025-05-07T20:25:57.7994400Z 
2025-05-07T20:25:57.7994405Z 
2025-05-07T20:25:57.7994411Z 
2025-05-07T20:25:57.7994416Z 
2025-05-07T20:25:57.7994864Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.7995257Z 
2025-05-07T20:25:57.7995263Z 
2025-05-07T20:25:57.7995268Z 
2025-05-07T20:25:57.7995273Z 
2025-05-07T20:25:57.7995278Z 
2025-05-07T20:25:57.7995283Z 
2025-05-07T20:25:57.7995289Z 
2025-05-07T20:25:57.7995294Z 
2025-05-07T20:25:57.7995300Z 
2025-05-07T20:25:57.7995305Z 
2025-05-07T20:25:57.7995311Z 
2025-05-07T20:25:57.7995317Z 
2025-05-07T20:25:57.7995327Z 
2025-05-07T20:25:57.7995332Z 
2025-05-07T20:25:57.7995721Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.7996046Z 
2025-05-07T20:25:57.7996051Z 
2025-05-07T20:25:57.7996056Z 
2025-05-07T20:25:57.7996062Z 
2025-05-07T20:25:57.7996067Z 
2025-05-07T20:25:57.7996079Z 
2025-05-07T20:25:57.7996085Z 
2025-05-07T20:25:57.7996259Z 
2025-05-07T20:25:57.7996266Z 
2025-05-07T20:25:57.7996272Z 
2025-05-07T20:25:57.7996276Z 
2025-05-07T20:25:57.7996282Z 
2025-05-07T20:25:57.7996288Z 
2025-05-07T20:25:57.7996408Z 
2025-05-07T20:25:57.7996413Z 
2025-05-07T20:25:57.7996743Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.7997079Z 
2025-05-07T20:25:57.7997084Z 
2025-05-07T20:25:57.7997090Z 
2025-05-07T20:25:57.7997095Z 
2025-05-07T20:25:57.7997100Z 
2025-05-07T20:25:57.7997105Z 
2025-05-07T20:25:57.7997110Z 
2025-05-07T20:25:57.7997115Z 
2025-05-07T20:25:57.7997120Z 
2025-05-07T20:25:57.7997126Z 
2025-05-07T20:25:57.7997131Z 
2025-05-07T20:25:57.7997136Z 
2025-05-07T20:25:57.7997141Z 
2025-05-07T20:25:57.7997146Z 
2025-05-07T20:25:57.7997151Z 
2025-05-07T20:25:57.7997156Z 
2025-05-07T20:25:57.7997455Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.7997780Z 
2025-05-07T20:25:57.7997792Z 
2025-05-07T20:25:57.7997797Z 
2025-05-07T20:25:57.7997802Z 
2025-05-07T20:25:57.7997807Z 
2025-05-07T20:25:57.7997812Z 
2025-05-07T20:25:57.7997817Z 
2025-05-07T20:25:57.7997829Z 
2025-05-07T20:25:57.7997834Z 
2025-05-07T20:25:57.7997847Z 
2025-05-07T20:25:57.7997853Z 
2025-05-07T20:25:57.7997857Z 
2025-05-07T20:25:57.7997863Z 
2025-05-07T20:25:57.7997876Z 
2025-05-07T20:25:57.7997881Z 
2025-05-07T20:25:57.7997887Z 
2025-05-07T20:25:57.7997891Z 
2025-05-07T20:25:57.7998187Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.7998522Z 
2025-05-07T20:25:57.7998527Z 
2025-05-07T20:25:57.7998532Z 
2025-05-07T20:25:57.7998537Z 
2025-05-07T20:25:57.7998543Z 
2025-05-07T20:25:57.7998548Z 
2025-05-07T20:25:57.7998553Z 
2025-05-07T20:25:57.7998558Z 
2025-05-07T20:25:57.7998563Z 
2025-05-07T20:25:57.7998568Z 
2025-05-07T20:25:57.7998574Z 
2025-05-07T20:25:57.7998579Z 
2025-05-07T20:25:57.7998584Z 
2025-05-07T20:25:57.7998589Z 
2025-05-07T20:25:57.7998601Z 
2025-05-07T20:25:57.7998606Z 
2025-05-07T20:25:57.7998611Z 
2025-05-07T20:25:57.7998616Z 
2025-05-07T20:25:57.7999439Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.7999774Z 
2025-05-07T20:25:57.7999783Z 
2025-05-07T20:25:57.8000074Z [A
2025-05-07T20:25:57.8000258Z 
2025-05-07T20:25:57.8000264Z 
2025-05-07T20:25:57.8000818Z [A[A
2025-05-07T20:25:57.8000980Z 
2025-05-07T20:25:57.8000990Z 
2025-05-07T20:25:57.8000995Z 
2025-05-07T20:25:57.8001649Z [A[A[A
2025-05-07T20:25:57.8001831Z 
2025-05-07T20:25:57.8001836Z 
2025-05-07T20:25:57.8001841Z 
2025-05-07T20:25:57.8001846Z 
2025-05-07T20:25:57.8002268Z [A[A[A[A
2025-05-07T20:25:57.8002444Z 
2025-05-07T20:25:57.8002449Z 
2025-05-07T20:25:57.8002454Z 
2025-05-07T20:25:57.8002459Z 
2025-05-07T20:25:57.8002468Z 
2025-05-07T20:25:57.8003063Z [A[A[A[A[A
2025-05-07T20:25:57.8003278Z 
2025-05-07T20:25:57.8003284Z 
2025-05-07T20:25:57.8003290Z 
2025-05-07T20:25:57.8003308Z 
2025-05-07T20:25:57.8003314Z 
2025-05-07T20:25:57.8003325Z 
2025-05-07T20:25:57.8014194Z [A[A[A[A[A[A
2025-05-07T20:25:57.8014448Z 
2025-05-07T20:25:57.8014455Z 
2025-05-07T20:25:57.8014473Z 
2025-05-07T20:25:57.8014479Z 
2025-05-07T20:25:57.8014484Z 
2025-05-07T20:25:57.8014489Z 
2025-05-07T20:25:57.8014536Z 
2025-05-07T20:25:57.8014778Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.8015021Z 
2025-05-07T20:25:57.8015027Z 
2025-05-07T20:25:57.8015033Z 
2025-05-07T20:25:57.8015047Z 
2025-05-07T20:25:57.8015053Z 
2025-05-07T20:25:57.8015059Z 
2025-05-07T20:25:57.8015064Z 
2025-05-07T20:25:57.8015070Z 
2025-05-07T20:25:57.8015272Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8015535Z 
2025-05-07T20:25:57.8015541Z 
2025-05-07T20:25:57.8015546Z 
2025-05-07T20:25:57.8015563Z 
2025-05-07T20:25:57.8015569Z 
2025-05-07T20:25:57.8015575Z 
2025-05-07T20:25:57.8015581Z 
2025-05-07T20:25:57.8015586Z 
2025-05-07T20:25:57.8015592Z 
2025-05-07T20:25:57.8015790Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8016208Z 
2025-05-07T20:25:57.8016224Z 
2025-05-07T20:25:57.8016231Z 
2025-05-07T20:25:57.8016236Z 
2025-05-07T20:25:57.8016242Z 
2025-05-07T20:25:57.8016248Z 
2025-05-07T20:25:57.8016348Z 
2025-05-07T20:25:57.8016354Z 
2025-05-07T20:25:57.8016359Z 
2025-05-07T20:25:57.8016365Z 
2025-05-07T20:25:57.8016603Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8016902Z 
2025-05-07T20:25:57.8016908Z 
2025-05-07T20:25:57.8016914Z 
2025-05-07T20:25:57.8016919Z 
2025-05-07T20:25:57.8016925Z 
2025-05-07T20:25:57.8016931Z 
2025-05-07T20:25:57.8016937Z 
2025-05-07T20:25:57.8016942Z 
2025-05-07T20:25:57.8016948Z 
2025-05-07T20:25:57.8016954Z 
2025-05-07T20:25:57.8016960Z 
2025-05-07T20:25:57.8017169Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8017483Z 
2025-05-07T20:25:57.8017489Z 
2025-05-07T20:25:57.8017495Z 
2025-05-07T20:25:57.8017501Z 
2025-05-07T20:25:57.8017507Z 
2025-05-07T20:25:57.8017512Z 
2025-05-07T20:25:57.8017518Z 
2025-05-07T20:25:57.8017524Z 
2025-05-07T20:25:57.8017530Z 
2025-05-07T20:25:57.8017544Z 
2025-05-07T20:25:57.8017550Z 
2025-05-07T20:25:57.8017556Z 
2025-05-07T20:25:57.8017785Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8018221Z 
2025-05-07T20:25:57.8018236Z 
2025-05-07T20:25:57.8018242Z 
2025-05-07T20:25:57.8018248Z 
2025-05-07T20:25:57.8018254Z 
2025-05-07T20:25:57.8018260Z 
2025-05-07T20:25:57.8018266Z 
2025-05-07T20:25:57.8018272Z 
2025-05-07T20:25:57.8018278Z 
2025-05-07T20:25:57.8018284Z 
2025-05-07T20:25:57.8018290Z 
2025-05-07T20:25:57.8018296Z 
2025-05-07T20:25:57.8018302Z 
2025-05-07T20:25:57.8018541Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8018835Z 
2025-05-07T20:25:57.8018841Z 
2025-05-07T20:25:57.8018847Z 
2025-05-07T20:25:57.8018853Z 
2025-05-07T20:25:57.8018858Z 
2025-05-07T20:25:57.8018864Z 
2025-05-07T20:25:57.8018869Z 
2025-05-07T20:25:57.8018875Z 
2025-05-07T20:25:57.8018890Z 
2025-05-07T20:25:57.8018896Z 
2025-05-07T20:25:57.8018902Z 
2025-05-07T20:25:57.8018908Z 
2025-05-07T20:25:57.8018914Z 
2025-05-07T20:25:57.8018956Z 
2025-05-07T20:25:57.8019219Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8019525Z 
2025-05-07T20:25:57.8019531Z 
2025-05-07T20:25:57.8019537Z 
2025-05-07T20:25:57.8019543Z 
2025-05-07T20:25:57.8019555Z 
2025-05-07T20:25:57.8019561Z 
2025-05-07T20:25:57.8019567Z 
2025-05-07T20:25:57.8019573Z 
2025-05-07T20:25:57.8019578Z 
2025-05-07T20:25:57.8019584Z 
2025-05-07T20:25:57.8019590Z 
2025-05-07T20:25:57.8019596Z 
2025-05-07T20:25:57.8019602Z 
2025-05-07T20:25:57.8019617Z 
2025-05-07T20:25:57.8019623Z 
2025-05-07T20:25:57.8019878Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8020193Z 
2025-05-07T20:25:57.8020199Z 
2025-05-07T20:25:57.8020205Z 
2025-05-07T20:25:57.8020211Z 
2025-05-07T20:25:57.8020216Z 
2025-05-07T20:25:57.8020231Z 
2025-05-07T20:25:57.8020237Z 
2025-05-07T20:25:57.8020243Z 
2025-05-07T20:25:57.8020248Z 
2025-05-07T20:25:57.8020254Z 
2025-05-07T20:25:57.8020260Z 
2025-05-07T20:25:57.8020265Z 
2025-05-07T20:25:57.8020271Z 
2025-05-07T20:25:57.8020276Z 
2025-05-07T20:25:57.8020289Z 
2025-05-07T20:25:57.8020295Z 
2025-05-07T20:25:57.8020565Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8020899Z 
2025-05-07T20:25:57.8020906Z 
2025-05-07T20:25:57.8020917Z 
2025-05-07T20:25:57.8020923Z 
2025-05-07T20:25:57.8020929Z 
2025-05-07T20:25:57.8020935Z 
2025-05-07T20:25:57.8020940Z 
2025-05-07T20:25:57.8020946Z 
2025-05-07T20:25:57.8020951Z 
2025-05-07T20:25:57.8020957Z 
2025-05-07T20:25:57.8020963Z 
2025-05-07T20:25:57.8020969Z 
2025-05-07T20:25:57.8020975Z 
2025-05-07T20:25:57.8020980Z 
2025-05-07T20:25:57.8020986Z 
2025-05-07T20:25:57.8020992Z 
2025-05-07T20:25:57.8020997Z 
2025-05-07T20:25:57.8021279Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8021607Z 
2025-05-07T20:25:57.8021613Z 
2025-05-07T20:25:57.8021619Z 
2025-05-07T20:25:57.8021624Z 
2025-05-07T20:25:57.8021631Z 
2025-05-07T20:25:57.8021636Z 
2025-05-07T20:25:57.8021642Z 
2025-05-07T20:25:57.8021648Z 
2025-05-07T20:25:57.8021654Z 
2025-05-07T20:25:57.8021659Z 
2025-05-07T20:25:57.8021803Z 
2025-05-07T20:25:57.8021810Z 
2025-05-07T20:25:57.8021816Z 
2025-05-07T20:25:57.8021822Z 
2025-05-07T20:25:57.8021828Z 
2025-05-07T20:25:57.8021834Z 
2025-05-07T20:25:57.8021926Z 
2025-05-07T20:25:57.8021932Z 
2025-05-07T20:25:57.8022202Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8022573Z 
2025-05-07T20:25:57.8022579Z 
2025-05-07T20:25:57.8022744Z [A
2025-05-07T20:25:57.8022907Z 
2025-05-07T20:25:57.8022912Z 
2025-05-07T20:25:57.8023093Z [A[A
2025-05-07T20:25:57.8023279Z 
2025-05-07T20:25:57.8023285Z 
2025-05-07T20:25:57.8023291Z 
2025-05-07T20:25:57.8023459Z [A[A[A
2025-05-07T20:25:57.8023640Z 
2025-05-07T20:25:57.8023645Z 
2025-05-07T20:25:57.8023651Z 
2025-05-07T20:25:57.8023656Z 
2025-05-07T20:25:57.8023839Z [A[A[A[A
2025-05-07T20:25:57.8024050Z 
2025-05-07T20:25:57.8024055Z 
2025-05-07T20:25:57.8024061Z 
2025-05-07T20:25:57.8024067Z 
2025-05-07T20:25:57.8024072Z 
2025-05-07T20:25:57.8024240Z [A[A[A[A[A
2025-05-07T20:25:57.8024458Z 
2025-05-07T20:25:57.8024464Z 
2025-05-07T20:25:57.8024470Z 
2025-05-07T20:25:57.8024476Z 
2025-05-07T20:25:57.8024482Z 
2025-05-07T20:25:57.8024488Z 
2025-05-07T20:25:57.8024679Z [A[A[A[A[A[A
2025-05-07T20:25:57.8024898Z 
2025-05-07T20:25:57.8024904Z 
2025-05-07T20:25:57.8024909Z 
2025-05-07T20:25:57.8024914Z 
2025-05-07T20:25:57.8024919Z 
2025-05-07T20:25:57.8024924Z 
2025-05-07T20:25:57.8024929Z 
2025-05-07T20:25:57.8025127Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.8025365Z 
2025-05-07T20:25:57.8025380Z 
2025-05-07T20:25:57.8025386Z 
2025-05-07T20:25:57.8025392Z 
2025-05-07T20:25:57.8025398Z 
2025-05-07T20:25:57.8025403Z 
2025-05-07T20:25:57.8025409Z 
2025-05-07T20:25:57.8025415Z 
2025-05-07T20:25:57.8025606Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8025851Z 
2025-05-07T20:25:57.8025858Z 
2025-05-07T20:25:57.8025864Z 
2025-05-07T20:25:57.8025869Z 
2025-05-07T20:25:57.8025876Z 
2025-05-07T20:25:57.8025881Z 
2025-05-07T20:25:57.8025887Z 
2025-05-07T20:25:57.8025893Z 
2025-05-07T20:25:57.8025907Z 
2025-05-07T20:25:57.8026117Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8026396Z 
2025-05-07T20:25:57.8026403Z 
2025-05-07T20:25:57.8026411Z 
2025-05-07T20:25:57.8026416Z 
2025-05-07T20:25:57.8026428Z 
2025-05-07T20:25:57.8026434Z 
2025-05-07T20:25:57.8026439Z 
2025-05-07T20:25:57.8026445Z 
2025-05-07T20:25:57.8026451Z 
2025-05-07T20:25:57.8026457Z 
2025-05-07T20:25:57.8026691Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8026992Z 
2025-05-07T20:25:57.8026998Z 
2025-05-07T20:25:57.8027004Z 
2025-05-07T20:25:57.8027010Z 
2025-05-07T20:25:57.8027016Z 
2025-05-07T20:25:57.8027022Z 
2025-05-07T20:25:57.8027027Z 
2025-05-07T20:25:57.8027033Z 
2025-05-07T20:25:57.8027038Z 
2025-05-07T20:25:57.8027044Z 
2025-05-07T20:25:57.8027050Z 
2025-05-07T20:25:57.8027305Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8027612Z 
2025-05-07T20:25:57.8027618Z 
2025-05-07T20:25:57.8027624Z 
2025-05-07T20:25:57.8027630Z 
2025-05-07T20:25:57.8027635Z 
2025-05-07T20:25:57.8027641Z 
2025-05-07T20:25:57.8027647Z 
2025-05-07T20:25:57.8027660Z 
2025-05-07T20:25:57.8027666Z 
2025-05-07T20:25:57.8027671Z 
2025-05-07T20:25:57.8027692Z 
2025-05-07T20:25:57.8027698Z 
2025-05-07T20:25:57.8027913Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8028236Z 
2025-05-07T20:25:57.8028242Z 
2025-05-07T20:25:57.8028248Z 
2025-05-07T20:25:57.8028254Z 
2025-05-07T20:25:57.8028259Z 
2025-05-07T20:25:57.8028278Z 
2025-05-07T20:25:57.8028284Z 
2025-05-07T20:25:57.8028290Z 
2025-05-07T20:25:57.8028296Z 
2025-05-07T20:25:57.8028301Z 
2025-05-07T20:25:57.8028307Z 
2025-05-07T20:25:57.8028312Z 
2025-05-07T20:25:57.8028318Z 
2025-05-07T20:25:57.8028535Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8028873Z 
2025-05-07T20:25:57.8028878Z 
2025-05-07T20:25:57.8028884Z 
2025-05-07T20:25:57.8028890Z 
2025-05-07T20:25:57.8028896Z 
2025-05-07T20:25:57.8028901Z 
2025-05-07T20:25:57.8028907Z 
2025-05-07T20:25:57.8028913Z 
2025-05-07T20:25:57.8028918Z 
2025-05-07T20:25:57.8028924Z 
2025-05-07T20:25:57.8028930Z 
2025-05-07T20:25:57.8029100Z 
2025-05-07T20:25:57.8029108Z 
2025-05-07T20:25:57.8029114Z 
2025-05-07T20:25:57.8029361Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8029717Z 
2025-05-07T20:25:57.8029831Z 
2025-05-07T20:25:57.8029837Z 
2025-05-07T20:25:57.8029842Z 
2025-05-07T20:25:57.8029847Z 
2025-05-07T20:25:57.8029853Z 
2025-05-07T20:25:57.8029858Z 
2025-05-07T20:25:57.8029864Z 
2025-05-07T20:25:57.8029870Z 
2025-05-07T20:25:57.8029875Z 
2025-05-07T20:25:57.8029880Z 
2025-05-07T20:25:57.8029885Z 
2025-05-07T20:25:57.8029891Z 
2025-05-07T20:25:57.8029896Z 
2025-05-07T20:25:57.8029902Z 
2025-05-07T20:25:57.8030164Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8030503Z 
2025-05-07T20:25:57.8030509Z 
2025-05-07T20:25:57.8030514Z 
2025-05-07T20:25:57.8030519Z 
2025-05-07T20:25:57.8030525Z 
2025-05-07T20:25:57.8030530Z 
2025-05-07T20:25:57.8030536Z 
2025-05-07T20:25:57.8030541Z 
2025-05-07T20:25:57.8030546Z 
2025-05-07T20:25:57.8030565Z 
2025-05-07T20:25:57.8030570Z 
2025-05-07T20:25:57.8030585Z 
2025-05-07T20:25:57.8030591Z 
2025-05-07T20:25:57.8030596Z 
2025-05-07T20:25:57.8030602Z 
2025-05-07T20:25:57.8030608Z 
2025-05-07T20:25:57.8030863Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8031241Z 
2025-05-07T20:25:57.8031246Z 
2025-05-07T20:25:57.8031251Z 
2025-05-07T20:25:57.8031257Z 
2025-05-07T20:25:57.8031262Z 
2025-05-07T20:25:57.8031268Z 
2025-05-07T20:25:57.8031273Z 
2025-05-07T20:25:57.8031278Z 
2025-05-07T20:25:57.8031283Z 
2025-05-07T20:25:57.8031289Z 
2025-05-07T20:25:57.8031294Z 
2025-05-07T20:25:57.8031300Z 
2025-05-07T20:25:57.8031305Z 
2025-05-07T20:25:57.8031311Z 
2025-05-07T20:25:57.8031317Z 
2025-05-07T20:25:57.8031322Z 
2025-05-07T20:25:57.8031328Z 
2025-05-07T20:25:57.8031616Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8031954Z 
2025-05-07T20:25:57.8031960Z 
2025-05-07T20:25:57.8031966Z 
2025-05-07T20:25:57.8031972Z 
2025-05-07T20:25:57.8031978Z 
2025-05-07T20:25:57.8031983Z 
2025-05-07T20:25:57.8031989Z 
2025-05-07T20:25:57.8032002Z 
2025-05-07T20:25:57.8032008Z 
2025-05-07T20:25:57.8032014Z 
2025-05-07T20:25:57.8032020Z 
2025-05-07T20:25:57.8032026Z 
2025-05-07T20:25:57.8032032Z 
2025-05-07T20:25:57.8032037Z 
2025-05-07T20:25:57.8032059Z 
2025-05-07T20:25:57.8032065Z 
2025-05-07T20:25:57.8032071Z 
2025-05-07T20:25:57.8032076Z 
2025-05-07T20:25:57.8032364Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8032698Z 
2025-05-07T20:25:57.8032704Z 
2025-05-07T20:25:57.8032882Z [A
2025-05-07T20:25:57.8033051Z 
2025-05-07T20:25:57.8033056Z 
2025-05-07T20:25:57.8033222Z [A[A
2025-05-07T20:25:57.8033405Z 
2025-05-07T20:25:57.8033411Z 
2025-05-07T20:25:57.8033417Z 
2025-05-07T20:25:57.8033631Z [A[A[A
2025-05-07T20:25:57.8033804Z 
2025-05-07T20:25:57.8033810Z 
2025-05-07T20:25:57.8033815Z 
2025-05-07T20:25:57.8033820Z 
2025-05-07T20:25:57.8034003Z [A[A[A[A
2025-05-07T20:25:57.8034210Z 
2025-05-07T20:25:57.8034215Z 
2025-05-07T20:25:57.8034221Z 
2025-05-07T20:25:57.8034227Z 
2025-05-07T20:25:57.8034233Z 
2025-05-07T20:25:57.8034413Z [A[A[A[A[A
2025-05-07T20:25:57.8034627Z 
2025-05-07T20:25:57.8034633Z 
2025-05-07T20:25:57.8034639Z 
2025-05-07T20:25:57.8034645Z 
2025-05-07T20:25:57.8034651Z 
2025-05-07T20:25:57.8034663Z 
2025-05-07T20:25:57.8034848Z [A[A[A[A[A[A
2025-05-07T20:25:57.8035061Z 
2025-05-07T20:25:57.8035068Z 
2025-05-07T20:25:57.8035074Z 
2025-05-07T20:25:57.8035079Z 
2025-05-07T20:25:57.8035084Z 
2025-05-07T20:25:57.8035090Z 
2025-05-07T20:25:57.8035096Z 
2025-05-07T20:25:57.8035293Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.8035533Z 
2025-05-07T20:25:57.8035539Z 
2025-05-07T20:25:57.8035544Z 
2025-05-07T20:25:57.8035549Z 
2025-05-07T20:25:57.8035555Z 
2025-05-07T20:25:57.8035560Z 
2025-05-07T20:25:57.8035565Z 
2025-05-07T20:25:57.8035570Z 
2025-05-07T20:25:57.8035771Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8036037Z 
2025-05-07T20:25:57.8036043Z 
2025-05-07T20:25:57.8036049Z 
2025-05-07T20:25:57.8036054Z 
2025-05-07T20:25:57.8036060Z 
2025-05-07T20:25:57.8036066Z 
2025-05-07T20:25:57.8036201Z 
2025-05-07T20:25:57.8036208Z 
2025-05-07T20:25:57.8036214Z 
2025-05-07T20:25:57.8036439Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8036764Z 
2025-05-07T20:25:57.8036772Z 
2025-05-07T20:25:57.8036881Z 
2025-05-07T20:25:57.8036887Z 
2025-05-07T20:25:57.8036892Z 
2025-05-07T20:25:57.8036898Z 
2025-05-07T20:25:57.8036904Z 
2025-05-07T20:25:57.8036909Z 
2025-05-07T20:25:57.8036915Z 
2025-05-07T20:25:57.8036921Z 
2025-05-07T20:25:57.8037148Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8037414Z 
2025-05-07T20:25:57.8037420Z 
2025-05-07T20:25:57.8037425Z 
2025-05-07T20:25:57.8037430Z 
2025-05-07T20:25:57.8037436Z 
2025-05-07T20:25:57.8037441Z 
2025-05-07T20:25:57.8037447Z 
2025-05-07T20:25:57.8037453Z 
2025-05-07T20:25:57.8037458Z 
2025-05-07T20:25:57.8037463Z 
2025-05-07T20:25:57.8037468Z 
2025-05-07T20:25:57.8037700Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8037984Z 
2025-05-07T20:25:57.8037989Z 
2025-05-07T20:25:57.8037994Z 
2025-05-07T20:25:57.8038000Z 
2025-05-07T20:25:57.8038013Z 
2025-05-07T20:25:57.8038020Z 
2025-05-07T20:25:57.8038025Z 
2025-05-07T20:25:57.8038031Z 
2025-05-07T20:25:57.8038036Z 
2025-05-07T20:25:57.8038042Z 
2025-05-07T20:25:57.8038048Z 
2025-05-07T20:25:57.8038060Z 
2025-05-07T20:25:57.8038291Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8038602Z 
2025-05-07T20:25:57.8038607Z 
2025-05-07T20:25:57.8038613Z 
2025-05-07T20:25:57.8038618Z 
2025-05-07T20:25:57.8038623Z 
2025-05-07T20:25:57.8038628Z 
2025-05-07T20:25:57.8038633Z 
2025-05-07T20:25:57.8038638Z 
2025-05-07T20:25:57.8038644Z 
2025-05-07T20:25:57.8038658Z 
2025-05-07T20:25:57.8038664Z 
2025-05-07T20:25:57.8038669Z 
2025-05-07T20:25:57.8038675Z 
2025-05-07T20:25:57.8038938Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8039254Z 
2025-05-07T20:25:57.8039260Z 
2025-05-07T20:25:57.8039265Z 
2025-05-07T20:25:57.8039271Z 
2025-05-07T20:25:57.8039277Z 
2025-05-07T20:25:57.8039283Z 
2025-05-07T20:25:57.8039288Z 
2025-05-07T20:25:57.8039293Z 
2025-05-07T20:25:57.8039298Z 
2025-05-07T20:25:57.8039313Z 
2025-05-07T20:25:57.8039319Z 
2025-05-07T20:25:57.8039324Z 
2025-05-07T20:25:57.8039329Z 
2025-05-07T20:25:57.8039334Z 
2025-05-07T20:25:57.8039581Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8039906Z 
2025-05-07T20:25:57.8039912Z 
2025-05-07T20:25:57.8039918Z 
2025-05-07T20:25:57.8039923Z 
2025-05-07T20:25:57.8039929Z 
2025-05-07T20:25:57.8039935Z 
2025-05-07T20:25:57.8039940Z 
2025-05-07T20:25:57.8039945Z 
2025-05-07T20:25:57.8039950Z 
2025-05-07T20:25:57.8039956Z 
2025-05-07T20:25:57.8039962Z 
2025-05-07T20:25:57.8039968Z 
2025-05-07T20:25:57.8039974Z 
2025-05-07T20:25:57.8039979Z 
2025-05-07T20:25:57.8039993Z 
2025-05-07T20:25:57.8040240Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8040569Z 
2025-05-07T20:25:57.8040575Z 
2025-05-07T20:25:57.8040581Z 
2025-05-07T20:25:57.8040587Z 
2025-05-07T20:25:57.8040592Z 
2025-05-07T20:25:57.8040597Z 
2025-05-07T20:25:57.8040611Z 
2025-05-07T20:25:57.8040618Z 
2025-05-07T20:25:57.8040624Z 
2025-05-07T20:25:57.8040635Z 
2025-05-07T20:25:57.8040640Z 
2025-05-07T20:25:57.8040645Z 
2025-05-07T20:25:57.8040650Z 
2025-05-07T20:25:57.8040656Z 
2025-05-07T20:25:57.8040661Z 
2025-05-07T20:25:57.8040666Z 
2025-05-07T20:25:57.8040930Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8041266Z 
2025-05-07T20:25:57.8041271Z 
2025-05-07T20:25:57.8041276Z 
2025-05-07T20:25:57.8041281Z 
2025-05-07T20:25:57.8041287Z 
2025-05-07T20:25:57.8041292Z 
2025-05-07T20:25:57.8041297Z 
2025-05-07T20:25:57.8041302Z 
2025-05-07T20:25:57.8041307Z 
2025-05-07T20:25:57.8041312Z 
2025-05-07T20:25:57.8041317Z 
2025-05-07T20:25:57.8041323Z 
2025-05-07T20:25:57.8041328Z 
2025-05-07T20:25:57.8041333Z 
2025-05-07T20:25:57.8041339Z 
2025-05-07T20:25:57.8041344Z 
2025-05-07T20:25:57.8041349Z 
2025-05-07T20:25:57.8041628Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8041958Z 
2025-05-07T20:25:57.8041964Z 
2025-05-07T20:25:57.8041969Z 
2025-05-07T20:25:57.8041975Z 
2025-05-07T20:25:57.8041981Z 
2025-05-07T20:25:57.8042109Z 
2025-05-07T20:25:57.8042116Z 
2025-05-07T20:25:57.8042123Z 
2025-05-07T20:25:57.8042128Z 
2025-05-07T20:25:57.8042135Z 
2025-05-07T20:25:57.8042140Z 
2025-05-07T20:25:57.8042245Z 
2025-05-07T20:25:57.8042252Z 
2025-05-07T20:25:57.8042257Z 
2025-05-07T20:25:57.8042263Z 
2025-05-07T20:25:57.8042269Z 
2025-05-07T20:25:57.8042274Z 
2025-05-07T20:25:57.8042280Z 
2025-05-07T20:25:57.8042543Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8042921Z 
2025-05-07T20:25:57.8042927Z 
2025-05-07T20:25:57.8043084Z [A
2025-05-07T20:25:57.8043245Z 
2025-05-07T20:25:57.8043251Z 
2025-05-07T20:25:57.8043450Z [A[A
2025-05-07T20:25:57.8043621Z 
2025-05-07T20:25:57.8043626Z 
2025-05-07T20:25:57.8043632Z 
2025-05-07T20:25:57.8043811Z [A[A[A
2025-05-07T20:25:57.8043995Z 
2025-05-07T20:25:57.8044001Z 
2025-05-07T20:25:57.8044007Z 
2025-05-07T20:25:57.8044012Z 
2025-05-07T20:25:57.8044178Z [A[A[A[A
2025-05-07T20:25:57.8044380Z 
2025-05-07T20:25:57.8044386Z 
2025-05-07T20:25:57.8044401Z 
2025-05-07T20:25:57.8044407Z 
2025-05-07T20:25:57.8044413Z 
2025-05-07T20:25:57.8044593Z [A[A[A[A[A
2025-05-07T20:25:57.8044794Z 
2025-05-07T20:25:57.8044800Z 
2025-05-07T20:25:57.8044814Z 
2025-05-07T20:25:57.8044820Z 
2025-05-07T20:25:57.8044825Z 
2025-05-07T20:25:57.8044831Z 
2025-05-07T20:25:57.8045023Z [A[A[A[A[A[A
2025-05-07T20:25:57.8045244Z 
2025-05-07T20:25:57.8045249Z 
2025-05-07T20:25:57.8045254Z 
2025-05-07T20:25:57.8045260Z 
2025-05-07T20:25:57.8045265Z 
2025-05-07T20:25:57.8045270Z 
2025-05-07T20:25:57.8045275Z 
2025-05-07T20:25:57.8045465Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.8045709Z 
2025-05-07T20:25:57.8045715Z 
2025-05-07T20:25:57.8045721Z 
2025-05-07T20:25:57.8045726Z 
2025-05-07T20:25:57.8045732Z 
2025-05-07T20:25:57.8045739Z 
2025-05-07T20:25:57.8045744Z 
2025-05-07T20:25:57.8045750Z 
2025-05-07T20:25:57.8045940Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8046204Z 
2025-05-07T20:25:57.8046211Z 
2025-05-07T20:25:57.8046216Z 
2025-05-07T20:25:57.8046222Z 
2025-05-07T20:25:57.8046235Z 
2025-05-07T20:25:57.8046241Z 
2025-05-07T20:25:57.8046247Z 
2025-05-07T20:25:57.8046252Z 
2025-05-07T20:25:57.8046258Z 
2025-05-07T20:25:57.8046457Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8046743Z 
2025-05-07T20:25:57.8046749Z 
2025-05-07T20:25:57.8046755Z 
2025-05-07T20:25:57.8046760Z 
2025-05-07T20:25:57.8046766Z 
2025-05-07T20:25:57.8046772Z 
2025-05-07T20:25:57.8046778Z 
2025-05-07T20:25:57.8046784Z 
2025-05-07T20:25:57.8046790Z 
2025-05-07T20:25:57.8046796Z 
2025-05-07T20:25:57.8047045Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8047331Z 
2025-05-07T20:25:57.8047338Z 
2025-05-07T20:25:57.8047344Z 
2025-05-07T20:25:57.8047349Z 
2025-05-07T20:25:57.8047355Z 
2025-05-07T20:25:57.8047361Z 
2025-05-07T20:25:57.8047366Z 
2025-05-07T20:25:57.8047372Z 
2025-05-07T20:25:57.8047378Z 
2025-05-07T20:25:57.8047383Z 
2025-05-07T20:25:57.8047388Z 
2025-05-07T20:25:57.8047608Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8047910Z 
2025-05-07T20:25:57.8047916Z 
2025-05-07T20:25:57.8047927Z 
2025-05-07T20:25:57.8047932Z 
2025-05-07T20:25:57.8047937Z 
2025-05-07T20:25:57.8047942Z 
2025-05-07T20:25:57.8047948Z 
2025-05-07T20:25:57.8047962Z 
2025-05-07T20:25:57.8047973Z 
2025-05-07T20:25:57.8047978Z 
2025-05-07T20:25:57.8047983Z 
2025-05-07T20:25:57.8047988Z 
2025-05-07T20:25:57.8048221Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8048515Z 
2025-05-07T20:25:57.8048521Z 
2025-05-07T20:25:57.8048536Z 
2025-05-07T20:25:57.8048542Z 
2025-05-07T20:25:57.8048548Z 
2025-05-07T20:25:57.8048553Z 
2025-05-07T20:25:57.8048559Z 
2025-05-07T20:25:57.8048565Z 
2025-05-07T20:25:57.8048571Z 
2025-05-07T20:25:57.8048577Z 
2025-05-07T20:25:57.8048583Z 
2025-05-07T20:25:57.8048588Z 
2025-05-07T20:25:57.8048594Z 
2025-05-07T20:25:57.8048825Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8049123Z 
2025-05-07T20:25:57.8049129Z 
2025-05-07T20:25:57.8049135Z 
2025-05-07T20:25:57.8049141Z 
2025-05-07T20:25:57.8049147Z 
2025-05-07T20:25:57.8049153Z 
2025-05-07T20:25:57.8049159Z 
2025-05-07T20:25:57.8049292Z 
2025-05-07T20:25:57.8049299Z 
2025-05-07T20:25:57.8049305Z 
2025-05-07T20:25:57.8049311Z 
2025-05-07T20:25:57.8049317Z 
2025-05-07T20:25:57.8049323Z 
2025-05-07T20:25:57.8049476Z 
2025-05-07T20:25:57.8049727Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8050063Z 
2025-05-07T20:25:57.8050069Z 
2025-05-07T20:25:57.8050074Z 
2025-05-07T20:25:57.8050079Z 
2025-05-07T20:25:57.8050084Z 
2025-05-07T20:25:57.8050089Z 
2025-05-07T20:25:57.8050095Z 
2025-05-07T20:25:57.8050100Z 
2025-05-07T20:25:57.8050105Z 
2025-05-07T20:25:57.8050110Z 
2025-05-07T20:25:57.8050116Z 
2025-05-07T20:25:57.8050121Z 
2025-05-07T20:25:57.8050127Z 
2025-05-07T20:25:57.8050132Z 
2025-05-07T20:25:57.8050147Z 
2025-05-07T20:25:57.8050401Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8050720Z 
2025-05-07T20:25:57.8050725Z 
2025-05-07T20:25:57.8050730Z 
2025-05-07T20:25:57.8050735Z 
2025-05-07T20:25:57.8050741Z 
2025-05-07T20:25:57.8050746Z 
2025-05-07T20:25:57.8050761Z 
2025-05-07T20:25:57.8050774Z 
2025-05-07T20:25:57.8050779Z 
2025-05-07T20:25:57.8050785Z 
2025-05-07T20:25:57.8050790Z 
2025-05-07T20:25:57.8050795Z 
2025-05-07T20:25:57.8050800Z 
2025-05-07T20:25:57.8050814Z 
2025-05-07T20:25:57.8050820Z 
2025-05-07T20:25:57.8050826Z 
2025-05-07T20:25:57.8051113Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8051441Z 
2025-05-07T20:25:57.8051447Z 
2025-05-07T20:25:57.8051453Z 
2025-05-07T20:25:57.8051458Z 
2025-05-07T20:25:57.8051464Z 
2025-05-07T20:25:57.8051470Z 
2025-05-07T20:25:57.8051476Z 
2025-05-07T20:25:57.8051482Z 
2025-05-07T20:25:57.8051488Z 
2025-05-07T20:25:57.8051494Z 
2025-05-07T20:25:57.8051500Z 
2025-05-07T20:25:57.8051506Z 
2025-05-07T20:25:57.8051511Z 
2025-05-07T20:25:57.8051517Z 
2025-05-07T20:25:57.8051523Z 
2025-05-07T20:25:57.8051529Z 
2025-05-07T20:25:57.8051535Z 
2025-05-07T20:25:57.8051800Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8052137Z 
2025-05-07T20:25:57.8052143Z 
2025-05-07T20:25:57.8052149Z 
2025-05-07T20:25:57.8052163Z 
2025-05-07T20:25:57.8052169Z 
2025-05-07T20:25:57.8052175Z 
2025-05-07T20:25:57.8052181Z 
2025-05-07T20:25:57.8052194Z 
2025-05-07T20:25:57.8052200Z 
2025-05-07T20:25:57.8052213Z 
2025-05-07T20:25:57.8052219Z 
2025-05-07T20:25:57.8052224Z 
2025-05-07T20:25:57.8052230Z 
2025-05-07T20:25:57.8052236Z 
2025-05-07T20:25:57.8052242Z 
2025-05-07T20:25:57.8052248Z 
2025-05-07T20:25:57.8052254Z 
2025-05-07T20:25:57.8052260Z 
2025-05-07T20:25:57.8052520Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8052891Z 
2025-05-07T20:25:57.8052898Z 
2025-05-07T20:25:57.8053064Z [A
2025-05-07T20:25:57.8053238Z 
2025-05-07T20:25:57.8053244Z 
2025-05-07T20:25:57.8053417Z [A[A
2025-05-07T20:25:57.8053589Z 
2025-05-07T20:25:57.8053595Z 
2025-05-07T20:25:57.8053601Z 
2025-05-07T20:25:57.8053783Z [A[A[A
2025-05-07T20:25:57.8053953Z 
2025-05-07T20:25:57.8053959Z 
2025-05-07T20:25:57.8053964Z 
2025-05-07T20:25:57.8053970Z 
2025-05-07T20:25:57.8054153Z [A[A[A[A
2025-05-07T20:25:57.8054354Z 
2025-05-07T20:25:57.8054360Z 
2025-05-07T20:25:57.8054366Z 
2025-05-07T20:25:57.8054371Z 
2025-05-07T20:25:57.8054395Z 
2025-05-07T20:25:57.8054580Z [A[A[A[A[A
2025-05-07T20:25:57.8054801Z 
2025-05-07T20:25:57.8054807Z 
2025-05-07T20:25:57.8054812Z 
2025-05-07T20:25:57.8054817Z 
2025-05-07T20:25:57.8054822Z 
2025-05-07T20:25:57.8054828Z 
2025-05-07T20:25:57.8055014Z [A[A[A[A[A[A
2025-05-07T20:25:57.8055238Z 
2025-05-07T20:25:57.8055244Z 
2025-05-07T20:25:57.8055249Z 
2025-05-07T20:25:57.8055254Z 
2025-05-07T20:25:57.8055259Z 
2025-05-07T20:25:57.8055264Z 
2025-05-07T20:25:57.8055269Z 
2025-05-07T20:25:57.8055460Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.8055937Z 
2025-05-07T20:25:57.8055943Z 
2025-05-07T20:25:57.8055948Z 
2025-05-07T20:25:57.8055953Z 
2025-05-07T20:25:57.8055958Z 
2025-05-07T20:25:57.8055963Z 
2025-05-07T20:25:57.8055968Z 
2025-05-07T20:25:57.8055973Z 
2025-05-07T20:25:57.8056187Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8056484Z 
2025-05-07T20:25:57.8056489Z 
2025-05-07T20:25:57.8056716Z 
2025-05-07T20:25:57.8056724Z 
2025-05-07T20:25:57.8056730Z 
2025-05-07T20:25:57.8056736Z 
2025-05-07T20:25:57.8056741Z 
2025-05-07T20:25:57.8056747Z 
2025-05-07T20:25:57.8056891Z 
2025-05-07T20:25:57.8057119Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8057403Z 
2025-05-07T20:25:57.8057409Z 
2025-05-07T20:25:57.8057415Z 
2025-05-07T20:25:57.8057421Z 
2025-05-07T20:25:57.8057427Z 
2025-05-07T20:25:57.8057432Z 
2025-05-07T20:25:57.8057438Z 
2025-05-07T20:25:57.8057443Z 
2025-05-07T20:25:57.8057448Z 
2025-05-07T20:25:57.8057453Z 
2025-05-07T20:25:57.8057676Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8057950Z 
2025-05-07T20:25:57.8057956Z 
2025-05-07T20:25:57.8057961Z 
2025-05-07T20:25:57.8057966Z 
2025-05-07T20:25:57.8057972Z 
2025-05-07T20:25:57.8057977Z 
2025-05-07T20:25:57.8057982Z 
2025-05-07T20:25:57.8057987Z 
2025-05-07T20:25:57.8057992Z 
2025-05-07T20:25:57.8057998Z 
2025-05-07T20:25:57.8058004Z 
2025-05-07T20:25:57.8058394Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8058685Z 
2025-05-07T20:25:57.8058691Z 
2025-05-07T20:25:57.8058697Z 
2025-05-07T20:25:57.8058714Z 
2025-05-07T20:25:57.8058720Z 
2025-05-07T20:25:57.8058726Z 
2025-05-07T20:25:57.8058739Z 
2025-05-07T20:25:57.8058745Z 
2025-05-07T20:25:57.8058751Z 
2025-05-07T20:25:57.8058756Z 
2025-05-07T20:25:57.8058762Z 
2025-05-07T20:25:57.8058768Z 
2025-05-07T20:25:57.8058988Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8059304Z 
2025-05-07T20:25:57.8059309Z 
2025-05-07T20:25:57.8059314Z 
2025-05-07T20:25:57.8059320Z 
2025-05-07T20:25:57.8059325Z 
2025-05-07T20:25:57.8059330Z 
2025-05-07T20:25:57.8059335Z 
2025-05-07T20:25:57.8059340Z 
2025-05-07T20:25:57.8059345Z 
2025-05-07T20:25:57.8059351Z 
2025-05-07T20:25:57.8059356Z 
2025-05-07T20:25:57.8059361Z 
2025-05-07T20:25:57.8059367Z 
2025-05-07T20:25:57.8059589Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8059910Z 
2025-05-07T20:25:57.8059916Z 
2025-05-07T20:25:57.8059922Z 
2025-05-07T20:25:57.8059927Z 
2025-05-07T20:25:57.8059941Z 
2025-05-07T20:25:57.8059946Z 
2025-05-07T20:25:57.8059951Z 
2025-05-07T20:25:57.8059957Z 
2025-05-07T20:25:57.8059963Z 
2025-05-07T20:25:57.8059969Z 
2025-05-07T20:25:57.8059974Z 
2025-05-07T20:25:57.8059983Z 
2025-05-07T20:25:57.8059988Z 
2025-05-07T20:25:57.8059994Z 
2025-05-07T20:25:57.8060238Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8060543Z 
2025-05-07T20:25:57.8060549Z 
2025-05-07T20:25:57.8060554Z 
2025-05-07T20:25:57.8060560Z 
2025-05-07T20:25:57.8060565Z 
2025-05-07T20:25:57.8060570Z 
2025-05-07T20:25:57.8060576Z 
2025-05-07T20:25:57.8060581Z 
2025-05-07T20:25:57.8060587Z 
2025-05-07T20:25:57.8060592Z 
2025-05-07T20:25:57.8060607Z 
2025-05-07T20:25:57.8060612Z 
2025-05-07T20:25:57.8060618Z 
2025-05-07T20:25:57.8060624Z 
2025-05-07T20:25:57.8060629Z 
2025-05-07T20:25:57.8060879Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8061197Z 
2025-05-07T20:25:57.8061203Z 
2025-05-07T20:25:57.8061209Z 
2025-05-07T20:25:57.8061223Z 
2025-05-07T20:25:57.8061229Z 
2025-05-07T20:25:57.8061243Z 
2025-05-07T20:25:57.8061249Z 
2025-05-07T20:25:57.8061255Z 
2025-05-07T20:25:57.8061261Z 
2025-05-07T20:25:57.8061267Z 
2025-05-07T20:25:57.8061273Z 
2025-05-07T20:25:57.8061284Z 
2025-05-07T20:25:57.8061290Z 
2025-05-07T20:25:57.8061296Z 
2025-05-07T20:25:57.8061301Z 
2025-05-07T20:25:57.8061307Z 
2025-05-07T20:25:57.8061551Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8061907Z 
2025-05-07T20:25:57.8061913Z 
2025-05-07T20:25:57.8061919Z 
2025-05-07T20:25:57.8061924Z 
2025-05-07T20:25:57.8061930Z 
2025-05-07T20:25:57.8061936Z 
2025-05-07T20:25:57.8061942Z 
2025-05-07T20:25:57.8061947Z 
2025-05-07T20:25:57.8061952Z 
2025-05-07T20:25:57.8061957Z 
2025-05-07T20:25:57.8061963Z 
2025-05-07T20:25:57.8061969Z 
2025-05-07T20:25:57.8061974Z 
2025-05-07T20:25:57.8061980Z 
2025-05-07T20:25:57.8061985Z 
2025-05-07T20:25:57.8061990Z 
2025-05-07T20:25:57.8061995Z 
2025-05-07T20:25:57.8062269Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8062746Z 
2025-05-07T20:25:57.8062754Z 
2025-05-07T20:25:57.8062759Z 
2025-05-07T20:25:57.8062764Z 
2025-05-07T20:25:57.8062769Z 
2025-05-07T20:25:57.8062774Z 
2025-05-07T20:25:57.8062779Z 
2025-05-07T20:25:57.8062874Z 
2025-05-07T20:25:57.8062891Z 
2025-05-07T20:25:57.8062896Z 
2025-05-07T20:25:57.8062901Z 
2025-05-07T20:25:57.8062906Z 
2025-05-07T20:25:57.8062911Z 
2025-05-07T20:25:57.8062916Z 
2025-05-07T20:25:57.8062921Z 
2025-05-07T20:25:57.8062926Z 
2025-05-07T20:25:57.8062932Z 
2025-05-07T20:25:57.8062937Z 
2025-05-07T20:25:57.8063262Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8063612Z 
2025-05-07T20:25:57.8063618Z 
2025-05-07T20:25:57.8063780Z [A
2025-05-07T20:25:57.8063968Z 
2025-05-07T20:25:57.8063974Z 
2025-05-07T20:25:57.8064138Z [A[A
2025-05-07T20:25:57.8064319Z 
2025-05-07T20:25:57.8064325Z 
2025-05-07T20:25:57.8064343Z 
2025-05-07T20:25:57.8064524Z [A[A[A
2025-05-07T20:25:57.8064707Z 
2025-05-07T20:25:57.8064713Z 
2025-05-07T20:25:57.8064719Z 
2025-05-07T20:25:57.8064734Z 
2025-05-07T20:25:57.8064914Z [A[A[A[A
2025-05-07T20:25:57.8065111Z 
2025-05-07T20:25:57.8065117Z 
2025-05-07T20:25:57.8065123Z 
2025-05-07T20:25:57.8065128Z 
2025-05-07T20:25:57.8065142Z 
2025-05-07T20:25:57.8065329Z [A[A[A[A[A
2025-05-07T20:25:57.8065538Z 
2025-05-07T20:25:57.8065543Z 
2025-05-07T20:25:57.8065548Z 
2025-05-07T20:25:57.8065553Z 
2025-05-07T20:25:57.8065559Z 
2025-05-07T20:25:57.8065564Z 
2025-05-07T20:25:57.8065758Z [A[A[A[A[A[A
2025-05-07T20:25:57.8065966Z 
2025-05-07T20:25:57.8065972Z 
2025-05-07T20:25:57.8065978Z 
2025-05-07T20:25:57.8065983Z 
2025-05-07T20:25:57.8065989Z 
2025-05-07T20:25:57.8065994Z 
2025-05-07T20:25:57.8066000Z 
2025-05-07T20:25:57.8066197Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.8066429Z 
2025-05-07T20:25:57.8066436Z 
2025-05-07T20:25:57.8066443Z 
2025-05-07T20:25:57.8066448Z 
2025-05-07T20:25:57.8066454Z 
2025-05-07T20:25:57.8066460Z 
2025-05-07T20:25:57.8066466Z 
2025-05-07T20:25:57.8066471Z 
2025-05-07T20:25:57.8066710Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8066980Z 
2025-05-07T20:25:57.8066986Z 
2025-05-07T20:25:57.8066991Z 
2025-05-07T20:25:57.8066996Z 
2025-05-07T20:25:57.8067001Z 
2025-05-07T20:25:57.8067006Z 
2025-05-07T20:25:57.8067018Z 
2025-05-07T20:25:57.8067024Z 
2025-05-07T20:25:57.8067029Z 
2025-05-07T20:25:57.8067245Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8067495Z 
2025-05-07T20:25:57.8067501Z 
2025-05-07T20:25:57.8067507Z 
2025-05-07T20:25:57.8067513Z 
2025-05-07T20:25:57.8067518Z 
2025-05-07T20:25:57.8067524Z 
2025-05-07T20:25:57.8067530Z 
2025-05-07T20:25:57.8067536Z 
2025-05-07T20:25:57.8067541Z 
2025-05-07T20:25:57.8067591Z 
2025-05-07T20:25:57.8067805Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8068085Z 
2025-05-07T20:25:57.8068091Z 
2025-05-07T20:25:57.8068097Z 
2025-05-07T20:25:57.8068102Z 
2025-05-07T20:25:57.8068108Z 
2025-05-07T20:25:57.8068114Z 
2025-05-07T20:25:57.8068119Z 
2025-05-07T20:25:57.8068125Z 
2025-05-07T20:25:57.8068130Z 
2025-05-07T20:25:57.8068136Z 
2025-05-07T20:25:57.8068142Z 
2025-05-07T20:25:57.8068372Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8068667Z 
2025-05-07T20:25:57.8068673Z 
2025-05-07T20:25:57.8068678Z 
2025-05-07T20:25:57.8068683Z 
2025-05-07T20:25:57.8068695Z 
2025-05-07T20:25:57.8068700Z 
2025-05-07T20:25:57.8068705Z 
2025-05-07T20:25:57.8068711Z 
2025-05-07T20:25:57.8068716Z 
2025-05-07T20:25:57.8068721Z 
2025-05-07T20:25:57.8068726Z 
2025-05-07T20:25:57.8068732Z 
2025-05-07T20:25:57.8068976Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8069273Z 
2025-05-07T20:25:57.8069279Z 
2025-05-07T20:25:57.8069285Z 
2025-05-07T20:25:57.8069291Z 
2025-05-07T20:25:57.8069298Z 
2025-05-07T20:25:57.8069303Z 
2025-05-07T20:25:57.8069309Z 
2025-05-07T20:25:57.8069315Z 
2025-05-07T20:25:57.8069320Z 
2025-05-07T20:25:57.8069326Z 
2025-05-07T20:25:57.8069331Z 
2025-05-07T20:25:57.8069336Z 
2025-05-07T20:25:57.8069341Z 
2025-05-07T20:25:57.8069579Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8069880Z 
2025-05-07T20:25:57.8069885Z 
2025-05-07T20:25:57.8070025Z 
2025-05-07T20:25:57.8070032Z 
2025-05-07T20:25:57.8070038Z 
2025-05-07T20:25:57.8070043Z 
2025-05-07T20:25:57.8070049Z 
2025-05-07T20:25:57.8070054Z 
2025-05-07T20:25:57.8070163Z 
2025-05-07T20:25:57.8070169Z 
2025-05-07T20:25:57.8070174Z 
2025-05-07T20:25:57.8070179Z 
2025-05-07T20:25:57.8070184Z 
2025-05-07T20:25:57.8070189Z 
2025-05-07T20:25:57.8070427Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8070719Z 
2025-05-07T20:25:57.8070734Z 
2025-05-07T20:25:57.8070740Z 
2025-05-07T20:25:57.8070745Z 
2025-05-07T20:25:57.8070750Z 
2025-05-07T20:25:57.8070756Z 
2025-05-07T20:25:57.8070762Z 
2025-05-07T20:25:57.8070768Z 
2025-05-07T20:25:57.8070773Z 
2025-05-07T20:25:57.8070779Z 
2025-05-07T20:25:57.8070784Z 
2025-05-07T20:25:57.8070789Z 
2025-05-07T20:25:57.8070795Z 
2025-05-07T20:25:57.8070800Z 
2025-05-07T20:25:57.8070806Z 
2025-05-07T20:25:57.8071035Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8071358Z 
2025-05-07T20:25:57.8071363Z 
2025-05-07T20:25:57.8071377Z 
2025-05-07T20:25:57.8071383Z 
2025-05-07T20:25:57.8071389Z 
2025-05-07T20:25:57.8071394Z 
2025-05-07T20:25:57.8071400Z 
2025-05-07T20:25:57.8071406Z 
2025-05-07T20:25:57.8071412Z 
2025-05-07T20:25:57.8071428Z 
2025-05-07T20:25:57.8071434Z 
2025-05-07T20:25:57.8071439Z 
2025-05-07T20:25:57.8071445Z 
2025-05-07T20:25:57.8071451Z 
2025-05-07T20:25:57.8071457Z 
2025-05-07T20:25:57.8071462Z 
2025-05-07T20:25:57.8071707Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8072032Z 
2025-05-07T20:25:57.8072037Z 
2025-05-07T20:25:57.8072043Z 
2025-05-07T20:25:57.8072048Z 
2025-05-07T20:25:57.8072054Z 
2025-05-07T20:25:57.8072059Z 
2025-05-07T20:25:57.8072065Z 
2025-05-07T20:25:57.8072070Z 
2025-05-07T20:25:57.8072076Z 
2025-05-07T20:25:57.8072090Z 
2025-05-07T20:25:57.8072095Z 
2025-05-07T20:25:57.8072100Z 
2025-05-07T20:25:57.8072106Z 
2025-05-07T20:25:57.8072111Z 
2025-05-07T20:25:57.8072117Z 
2025-05-07T20:25:57.8072122Z 
2025-05-07T20:25:57.8072128Z 
2025-05-07T20:25:57.8072373Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8072703Z 
2025-05-07T20:25:57.8072709Z 
2025-05-07T20:25:57.8072715Z 
2025-05-07T20:25:57.8072720Z 
2025-05-07T20:25:57.8072731Z 
2025-05-07T20:25:57.8072737Z 
2025-05-07T20:25:57.8072743Z 
2025-05-07T20:25:57.8072748Z 
2025-05-07T20:25:57.8072754Z 
2025-05-07T20:25:57.8072759Z 
2025-05-07T20:25:57.8072765Z 
2025-05-07T20:25:57.8072770Z 
2025-05-07T20:25:57.8072776Z 
2025-05-07T20:25:57.8072782Z 
2025-05-07T20:25:57.8072787Z 
2025-05-07T20:25:57.8072792Z 
2025-05-07T20:25:57.8072798Z 
2025-05-07T20:25:57.8072804Z 
2025-05-07T20:25:57.8073057Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8073379Z 
2025-05-07T20:25:57.8073385Z 
2025-05-07T20:25:57.8073529Z [A
2025-05-07T20:25:57.8073694Z 
2025-05-07T20:25:57.8073700Z 
2025-05-07T20:25:57.8073848Z [A[A
2025-05-07T20:25:57.8074013Z 
2025-05-07T20:25:57.8074019Z 
2025-05-07T20:25:57.8074023Z 
2025-05-07T20:25:57.8074196Z [A[A[A
2025-05-07T20:25:57.8074361Z 
2025-05-07T20:25:57.8074372Z 
2025-05-07T20:25:57.8074377Z 
2025-05-07T20:25:57.8074382Z 
2025-05-07T20:25:57.8074538Z [A[A[A[A
2025-05-07T20:25:57.8074705Z 
2025-05-07T20:25:57.8074710Z 
2025-05-07T20:25:57.8074729Z 
2025-05-07T20:25:57.8074735Z 
2025-05-07T20:25:57.8074740Z 
2025-05-07T20:25:57.8074897Z [A[A[A[A[A
2025-05-07T20:25:57.8075087Z 
2025-05-07T20:25:57.8075093Z 
2025-05-07T20:25:57.8075098Z 
2025-05-07T20:25:57.8075103Z 
2025-05-07T20:25:57.8075118Z 
2025-05-07T20:25:57.8075124Z 
2025-05-07T20:25:57.8075293Z [A[A[A[A[A[A
2025-05-07T20:25:57.8075481Z 
2025-05-07T20:25:57.8075486Z 
2025-05-07T20:25:57.8075491Z 
2025-05-07T20:25:57.8075497Z 
2025-05-07T20:25:57.8075502Z 
2025-05-07T20:25:57.8075507Z 
2025-05-07T20:25:57.8075521Z 
2025-05-07T20:25:57.8075693Z [A[A[A[A[A[A[A
2025-05-07T20:25:57.8075890Z 
2025-05-07T20:25:57.8075896Z 
2025-05-07T20:25:57.8075901Z 
2025-05-07T20:25:57.8075906Z 
2025-05-07T20:25:57.8075911Z 
2025-05-07T20:25:57.8075917Z 
2025-05-07T20:25:57.8075922Z 
2025-05-07T20:25:57.8076065Z 
2025-05-07T20:25:57.8076208Z [A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8076420Z 
2025-05-07T20:25:57.8076425Z 
2025-05-07T20:25:57.8076430Z 
2025-05-07T20:25:57.8076560Z 
2025-05-07T20:25:57.8076566Z 
2025-05-07T20:25:57.8076570Z 
2025-05-07T20:25:57.8076585Z 
2025-05-07T20:25:57.8076590Z 
2025-05-07T20:25:57.8076596Z 
2025-05-07T20:25:57.8076784Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8077011Z 
2025-05-07T20:25:57.8077016Z 
2025-05-07T20:25:57.8077021Z 
2025-05-07T20:25:57.8077026Z 
2025-05-07T20:25:57.8077031Z 
2025-05-07T20:25:57.8077044Z 
2025-05-07T20:25:57.8077049Z 
2025-05-07T20:25:57.8077055Z 
2025-05-07T20:25:57.8077059Z 
2025-05-07T20:25:57.8077064Z 
2025-05-07T20:25:57.8077257Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:57.8077505Z 
2025-05-07T20:25:57.8077511Z 
2025-05-07T20:25:57.8077527Z 
2025-05-07T20:25:57.8077533Z 
2025-05-07T20:25:57.8077538Z 
2025-05-07T20:25:57.8077543Z 
2025-05-07T20:25:57.8077549Z 
2025-05-07T20:25:57.8077554Z 
2025-05-07T20:25:57.8077568Z 
2025-05-07T20:25:57.8077573Z 
2025-05-07T20:25:57.8077578Z 
2025-05-07T20:25:57.8077793Z [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:58.1238064Z Preparing transaction: - \ | done
2025-05-07T20:25:59.5948034Z Verifying transaction: - \ | / - \ | / - \ | / - \ done
2025-05-07T20:26:00.3033021Z Executing transaction: / - \ | / - \ done
2025-05-07T20:26:02.6563680Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ...
2025-05-07T20:26:02.6564247Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:02.6565070Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:02.6565646Z 
2025-05-07T20:26:02.6578154Z 
2025-05-07T20:26:02.6579082Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:02.6579893Z 
2025-05-07T20:26:02.6591847Z 
2025-05-07T20:26:02.6592087Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:02.6597940Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:02.6601746Z 
2025-05-07T20:26:02.8323872Z 
2025-05-07T20:26:02.8330159Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:02.8334075Z 
2025-05-07T20:26:02.8352222Z 
2025-05-07T20:26:02.8352516Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:02.8717479Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:04.7492120Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:04.8114089Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:04.8114647Z 
2025-05-07T20:26:05.2326196Z 
2025-05-07T20:26:05.2335346Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:05.2677459Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:05.2678154Z 
2025-05-07T20:26:05.7001131Z 
2025-05-07T20:26:05.7001533Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:05.7002463Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:05.7003194Z 
2025-05-07T20:26:06.1218554Z 
2025-05-07T20:26:08.1351778Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:10.1507221Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:12.1716371Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:12.1717203Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:14.1861377Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:16.0768148Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:16.0768477Z 
2025-05-07T20:26:16.1398510Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:26:19.9740165Z /tmp/tmpv72y_08w: line 3: clang: command not found
2025-05-07T20:26:19.9740457Z 
2025-05-07T20:26:19.9740973Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:26:20.0371720Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:26:20.0372050Z 
2025-05-07T20:26:20.0393092Z total 36
2025-05-07T20:26:20.0393385Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:25 .
2025-05-07T20:26:20.0393776Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:24 ..
2025-05-07T20:26:20.0394231Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:26:20.0394768Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:26:20.0395249Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:26:20.0395717Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:26:20.0396176Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:26:20.0396623Z -rw-r--r--. 2 ec2-user ec2-user  2932 Nov 20 20:32 ~cuda-nvcc_activate.sh
2025-05-07T20:26:20.0396927Z 
2025-05-07T20:26:20.0397142Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:26:20.0397784Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:26:20.0398199Z 
2025-05-07T20:26:20.0420046Z 
2025-05-07T20:26:20.0420388Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:26:20.0420643Z 
2025-05-07T20:26:21.9909157Z 
2025-05-07T20:26:21.9910026Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:26:21.9911161Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:26:21.9911567Z 
2025-05-07T20:26:22.4135255Z 
2025-05-07T20:26:22.4135993Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:26:22.4137189Z 
2025-05-07T20:26:24.2916314Z -allow-unsupported-compiler
2025-05-07T20:26:24.3532044Z 
2025-05-07T20:26:24.3532049Z 
2025-05-07T20:26:24.3532404Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:26:24.3532920Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:26:24.3533244Z 
2025-05-07T20:26:26.3049198Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:26:26.3049962Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:26:26.3050377Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:26:26.3050714Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:26:26.3051082Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:26:26.3051350Z #define _STL_PAIR_H 1
2025-05-07T20:26:26.3051643Z #define __cpp_attributes 200809L
2025-05-07T20:26:26.3051980Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:26:26.3052351Z #define __DELETE_THROW throw()
2025-05-07T20:26:26.3052610Z #define _PTRDIFF_T_ 
2025-05-07T20:26:26.3052867Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:26:26.3053267Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:26:26.3053564Z #define _IO_LEFT 02
2025-05-07T20:26:26.3053836Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:26:26.3054105Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:26:26.3054389Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:26:26.3054816Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:26:26.3055253Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:26:26.3055809Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:26:26.3056175Z #define _IOS_OUTPUT 2
2025-05-07T20:26:26.3056628Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:26:26.3057215Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:26:26.3057653Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:26:26.3058166Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:26:26.3058567Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:26:26.3059604Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:26:26.3061014Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:26:26.3061376Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:26:26.3061687Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:26:26.3062096Z #define _T_WCHAR_ 
2025-05-07T20:26:26.3062413Z #define stdout stdout
2025-05-07T20:26:26.3062823Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:26:26.3063370Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:26:26.3063738Z #define __flexarr []
2025-05-07T20:26:26.3064128Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:26:26.3064585Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:26:26.3065034Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:26:26.3065398Z #define _MATH_H 1
2025-05-07T20:26:26.3065809Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:26:26.3066315Z #define __S64_TYPE long int
2025-05-07T20:26:26.3066675Z #define __stub_fchflags 
2025-05-07T20:26:26.3067045Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:26:26.3067451Z #define __SQUAD_TYPE long int
2025-05-07T20:26:26.3067822Z #define __INTMAX_C(c) c ## L
2025-05-07T20:26:26.3068191Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:26:26.3068553Z #define NL_NMAX INT_MAX
2025-05-07T20:26:26.3068877Z #define _BITS_TIME_H 1
2025-05-07T20:26:26.3069261Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:26:26.3069718Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:26:26.3070531Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:26:26.3071049Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:26:26.3071609Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:26:26.3072180Z #define __CHAR_BIT__ 8
2025-05-07T20:26:26.3072446Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.3072773Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:26:26.3073075Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:26:26.3073341Z #define FP_NAN 0
2025-05-07T20:26:26.3073614Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:26:26.3074076Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:26:26.3074580Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:26:26.3074977Z #define __cudaCDP2GetErrorString 
2025-05-07T20:26:26.3075274Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:26:26.3075538Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:26:26.3075801Z #define __SM_80_RT_H__ 
2025-05-07T20:26:26.3076043Z #define _NEW 
2025-05-07T20:26:26.3076279Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:26:26.3076562Z #define __UINT8_MAX__ 0xff
2025-05-07T20:26:26.3076939Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:26:26.3077369Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:26:26.3077612Z #define __USE_ANSI 1
2025-05-07T20:26:26.3077912Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:26:26.3078330Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:26:26.3078825Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:26:26.3079234Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:26:26.3079632Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:26:26.3079950Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:26:26.3080238Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:26:26.3080525Z #define PIPE_BUF 4096
2025-05-07T20:26:26.3080847Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:26:26.3081223Z #define ADJ_TICK 0x4000
2025-05-07T20:26:26.3081512Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:26:26.3081835Z #define MQ_PRIO_MAX 32768
2025-05-07T20:26:26.3082101Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:26:26.3082439Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:26:26.3082905Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:26.3083433Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:26:26.3083857Z #define _XOPEN_SOURCE 700
2025-05-07T20:26:26.3084119Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:26:26.3084395Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.3084691Z #define __cpp_static_assert 201411L
2025-05-07T20:26:26.3085034Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:26:26.3085384Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:26:26.3085665Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:26:26.3085952Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:26:26.3086264Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:26:26.3086546Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:26:26.3086854Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.3087223Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:26:26.3087565Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:26:26.3087854Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:26:26.3088180Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.3088543Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:26:26.3088904Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:26:26.3089206Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:26:26.3089509Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:26:26.3089843Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:26:26.3090177Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:26:26.3090732Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:26:26.3091148Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:26:26.3091465Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:26:26.3091744Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:26:26.3092113Z #define __GCC_IEC_559 2
2025-05-07T20:26:26.3092416Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:26:26.3092765Z #define _IO_flockfile(_fp) 
2025-05-07T20:26:26.3093030Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:26:26.3093308Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:26:26.3093578Z #define _IOFBF 0
2025-05-07T20:26:26.3093797Z #define __USE_BSD 1
2025-05-07T20:26:26.3094077Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:26:26.3094369Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:26:26.3094650Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:26:26.3094906Z #define _IO_NO_WRITES 8
2025-05-07T20:26:26.3095171Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:26:26.3095534Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:26:26.3095891Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:26:26.3096207Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:26:26.3096534Z #define __cpp_binary_literals 201304L
2025-05-07T20:26:26.3096833Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:26:26.3097110Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:26:26.3097406Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:26:26.3097825Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:26:26.3098328Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:26:26.3098705Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:26:26.3099022Z #define M_PI 3.14159265358979323846
2025-05-07T20:26:26.3099332Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:26:26.3099669Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:26:26.3099984Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:26:26.3100295Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:26:26.3100573Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:26:26.3100855Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:26:26.3101453Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:26:26.3102050Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:26:26.3102387Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:26:26.3102720Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:26:26.3103023Z #define __cudaCDP2GetErrorName 
2025-05-07T20:26:26.3103307Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:26:26.3115291Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:26:26.3115792Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:26:26.3116171Z #define __cpp_variadic_templates 200704L
2025-05-07T20:26:26.3116469Z #define RAND_MAX 2147483647
2025-05-07T20:26:26.3116731Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:26:26.3117060Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.3117375Z #define __SM_90_RT_H__ 
2025-05-07T20:26:26.3117614Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:26:26.3117886Z #define __COMPAR_FN_T 
2025-05-07T20:26:26.3118135Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:26:26.3118427Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:26:26.3118978Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:26:26.3119587Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:26:26.3119966Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:26:26.3120362Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:26:26.3120683Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:26.3121061Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:26:26.3121415Z #define __cpp_variable_templates 201304L
2025-05-07T20:26:26.3122012Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:26.3122872Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:26:26.3123458Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:26:26.3123744Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:26:26.3124093Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:26:26.3124500Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:26:26.3124773Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:26:26.3125038Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:26:26.3125309Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:26:26.3125560Z #define __u_char_defined 
2025-05-07T20:26:26.3125873Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:26:26.3126237Z #define STA_PPSERROR 0x0800
2025-05-07T20:26:26.3126498Z #define _GLIBCXX_STD_A std
2025-05-07T20:26:26.3126748Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:26:26.3127035Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:26:26.3127477Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:26:26.3127903Z #define FP_INFINITE 1
2025-05-07T20:26:26.3128272Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:26.3128690Z #define _IO_pid_t __pid_t
2025-05-07T20:26:26.3128945Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:26:26.3129201Z #define __LEAF , __leaf__
2025-05-07T20:26:26.3129456Z #define PATH_MAX 4096
2025-05-07T20:26:26.3129714Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:26:26.3130050Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:26:26.3130376Z #define _LIMITS_H___ 
2025-05-07T20:26:26.3130605Z #define __size_t 
2025-05-07T20:26:26.3130830Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:26:26.3131377Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:26:26.3131942Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:26:26.3132254Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:26:26.3132583Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:26:26.3132846Z #define _WCHAR_T_DEFINED 
2025-05-07T20:26:26.3133205Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:26:26.3133597Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:26:26.3133895Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:26:26.3134226Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:26:26.3134507Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:26:26.3134790Z #define __INT8_C(c) c
2025-05-07T20:26:26.3135050Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:26:26.3135344Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:26:26.3135609Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:26:26.3135866Z #define __SM_70_RT_HPP__ 
2025-05-07T20:26:26.3136113Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:26:26.3136391Z #define __cpp_variadic_using 201611L
2025-05-07T20:26:26.3136723Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.3137045Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:26:26.3137316Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:26:26.3137590Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:26:26.3137859Z #define __cpp_capture_star_this 201603L
2025-05-07T20:26:26.3138316Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:26:26.3138627Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:26:26.3138996Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:26:26.3139368Z #define NFDBITS __NFDBITS
2025-05-07T20:26:26.3139626Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:26:26.3139916Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:26:26.3140230Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:26:26.3140546Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:26:26.3140803Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:26:26.3141085Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:26:26.3141388Z #define STA_UNSYNC 0x0040
2025-05-07T20:26:26.3141698Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:26.3142111Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:26:26.3142496Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:26:26.3143032Z #define __cpp_if_constexpr 201606L
2025-05-07T20:26:26.3143361Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:26:26.3143756Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:26:26.3144249Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:26:26.3144565Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:26:26.3144892Z #define __daddr_t_defined 
2025-05-07T20:26:26.3145144Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:26:26.3145417Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:26:26.3145736Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:26:26.3146246Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:26:26.3146728Z #define _ACRTIMP 
2025-05-07T20:26:26.3146958Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:26:26.3147220Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:26:26.3147508Z #define _IOS_BIN 128
2025-05-07T20:26:26.3147870Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:26:26.3148278Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.3148549Z #define UNDERFLOW 4
2025-05-07T20:26:26.3148775Z #define NAME_MAX 255
2025-05-07T20:26:26.3149005Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:26:26.3149278Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:26:26.3149560Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:26:26.3149857Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:26:26.3150237Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:26:26.3150627Z #define __ptr_t void *
2025-05-07T20:26:26.3150866Z #define M_E 2.7182818284590452354
2025-05-07T20:26:26.3151139Z #define cudaSurfaceType1D 0x01
2025-05-07T20:26:26.3151408Z #define __USE_ISOCXX11 1
2025-05-07T20:26:26.3151674Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:26:26.3151990Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:26:26.3152289Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:26:26.3152572Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:26:26.3152856Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:26:26.3153186Z #define cudaSurfaceType2D 0x02
2025-05-07T20:26:26.3153450Z #define __linux 1
2025-05-07T20:26:26.3153678Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:26:26.3153978Z #define cudaDeviceMask 0xff
2025-05-07T20:26:26.3154276Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:26:26.3154565Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:26:26.3154842Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:26:26.3155127Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:26:26.3155443Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:26:26.3156051Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:26:26.3156347Z #define _BITS_TYPES_H 1
2025-05-07T20:26:26.3156639Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:26:26.3156975Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:26:26.3157283Z #define cudaSurfaceType3D 0x03
2025-05-07T20:26:26.3157571Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:26:26.3157857Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:26:26.3158148Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:26:26.3158939Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:26:26.3159757Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:26:26.3160036Z #define __unix 1
2025-05-07T20:26:26.3160258Z #define MATH_ERRNO 1
2025-05-07T20:26:26.3160503Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:26:26.3160779Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:26:26.3161053Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:26:26.3161341Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:26:26.3161626Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:26:26.3161910Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:26:26.3162629Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:26:26.3163096Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:26:26.3163399Z #define CUDARTAPI_CDECL 
2025-05-07T20:26:26.3163790Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:26:26.3164069Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:26:26.3164351Z #define __cpp_lib_void_t 201411
2025-05-07T20:26:26.3164616Z #define _POSIX_AIO_MAX 1
2025-05-07T20:26:26.3164854Z #define __SIZE_T 
2025-05-07T20:26:26.3165101Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:26:26.3165424Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:26:26.3165723Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:26:26.3165983Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:26:26.3166251Z #define _ATFILE_SOURCE 1
2025-05-07T20:26:26.3166647Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:26:26.3167078Z #define __WAIT_STATUS void *
2025-05-07T20:26:26.3167352Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:26:26.3167623Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:26:26.3167888Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:26:26.3168179Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:26:26.3168461Z #define __WINT_MIN__ 0U
2025-05-07T20:26:26.3169040Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:26:26.3169680Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:26:26.3169982Z #define WUNTRACED 2
2025-05-07T20:26:26.3170215Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:26:26.3170490Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:26:26.3170778Z #define NZERO 20
2025-05-07T20:26:26.3171010Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:26:26.3171281Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:26:26.3171574Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:26:26.3171868Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:26:26.3172121Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:26:26.3172405Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:26:26.3172677Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:26:26.3172955Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:26:26.3173226Z #define EXIT_FAILURE 1
2025-05-07T20:26:26.3173468Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:26:26.3173758Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:26:26.3174036Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:26:26.3174290Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:26:26.3174570Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:26:26.3174904Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:26:26.3175264Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:26:26.3175556Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:26:26.3175806Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:26:26.3176079Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:26:26.3176374Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:26:26.3176679Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:26:26.3176969Z #define SEEK_DATA 3
2025-05-07T20:26:26.3177221Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:26:26.3177511Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:26:26.3177934Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:26:26.3178452Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:26:26.3178706Z #define __INT64_C(c) c ## L
2025-05-07T20:26:26.3178976Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:26:26.3179311Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:26:26.3179637Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:26:26.3179913Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:26:26.3180211Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:26:26.3180515Z #define STA_PPSWANDER 0x0400
2025-05-07T20:26:26.3180772Z #define __INT_WCHAR_T_H 
2025-05-07T20:26:26.3181012Z #define WSTOPPED 2
2025-05-07T20:26:26.3181300Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:26:26.3182659Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:26:26.3182959Z #define FP_NORMAL 4
2025-05-07T20:26:26.3183209Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:26:26.3183490Z #define _BITS_TIMEX_H 1
2025-05-07T20:26:26.3183839Z #define _POSIX_LINK_MAX 8
2025-05-07T20:26:26.3184127Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:26:26.3184421Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:26:26.3184693Z #define cudaTextureType1D 0x01
2025-05-07T20:26:26.3184967Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:26:26.3185233Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:26:26.3185499Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:26:26.3185797Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:26:26.3186228Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:26:26.3186673Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:26:26.3186945Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:26:26.3187218Z #define _POSIX_SOURCE 1
2025-05-07T20:26:26.3187465Z #define cudaTextureType2D 0x02
2025-05-07T20:26:26.3187738Z #define _PTR_TRAITS_H 1
2025-05-07T20:26:26.3188012Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:26:26.3188323Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:26:26.3188601Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:26:26.3188929Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:26:26.3189272Z #define cudaTextureType3D 0x03
2025-05-07T20:26:26.3189539Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:26:26.3189800Z #define CLOCK_REALTIME 0
2025-05-07T20:26:26.3190050Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:26:26.3190320Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:26:26.3190626Z #define __cpp_aligned_new 201606L
2025-05-07T20:26:26.3190916Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:26:26.3191190Z #define cudaEventBlockingSync 0x01
2025-05-07T20:26:26.3191482Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:26:26.3191755Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:26:26.3192108Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:26:26.3192497Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:26:26.3192781Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:26:26.3193027Z #define __GLIBC__ 2
2025-05-07T20:26:26.3193255Z #define __END_DECLS }
2025-05-07T20:26:26.3193497Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:26:26.3193862Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:26:26.3194236Z #define __CONCAT(x,y) x ## y
2025-05-07T20:26:26.3194490Z #define WCONTINUED 8
2025-05-07T20:26:26.3194725Z #define __STDC_HOSTED__ 1
2025-05-07T20:26:26.3194981Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:26:26.3195261Z #define _ALLOCA_H 1
2025-05-07T20:26:26.3195496Z #define __host__ __location__(host)
2025-05-07T20:26:26.3195919Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:26:26.3196359Z #define __SLONG32_TYPE int
2025-05-07T20:26:26.3196636Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:26:26.3196926Z #define _SYS_SELECT_H 1
2025-05-07T20:26:26.3197174Z #define _IO_LINE_BUF 0x200
2025-05-07T20:26:26.3197422Z #define _IOS_NOCREATE 32
2025-05-07T20:26:26.3197666Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:26:26.3197955Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:26:26.3198249Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:26:26.3198538Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:26:26.3198816Z #define __global__ __location__(global)
2025-05-07T20:26:26.3199109Z #define __GNU_LIBRARY__ 6
2025-05-07T20:26:26.3199367Z #define __cpp_decltype_auto 201304L
2025-05-07T20:26:26.3199640Z #define __DBL_DIG__ 15
2025-05-07T20:26:26.3199869Z #define TIME_UTC 1
2025-05-07T20:26:26.3200087Z #define __FLT32_DIG__ 6
2025-05-07T20:26:26.3200406Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:26:26.3200808Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:26:26.3201127Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:26:26.3201433Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:26:26.3201850Z #define _G_BUFSIZ 8192
2025-05-07T20:26:26.3202159Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:26:26.3202530Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:26:26.3203006Z #define __cudaCDP2GetDevice 
2025-05-07T20:26:26.3203373Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:26:26.3203690Z #define STA_CLOCKERR 0x1000
2025-05-07T20:26:26.3203958Z #define __GXX_WEAK__ 1
2025-05-07T20:26:26.3204214Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:26.3204518Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:26:26.3204777Z #define __SHRT_WIDTH__ 16
2025-05-07T20:26:26.3205077Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:26:26.3205418Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:26:26.3205692Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:26:26.3205980Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:26:26.3206277Z #define _G_config_h 1
2025-05-07T20:26:26.3206560Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:26:26.3206899Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:26:26.3207183Z #define _GCC_WCHAR_T 
2025-05-07T20:26:26.3207422Z #define TMP_MAX 238328
2025-05-07T20:26:26.3207664Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:26:26.3207939Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:26:26.3208212Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:26.3208489Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:26:26.3208773Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:26:26.3209060Z #define _IO_SKIPWS 01
2025-05-07T20:26:26.3209458Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:26:26.3209918Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:26:26.3210187Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:26:26.3210516Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:26:26.3210884Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:26:26.3211260Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:26:26.3211627Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:26:26.3211877Z #define le32toh(x) (x)
2025-05-07T20:26:26.3212118Z #define _SIZE_T_DEFINED 
2025-05-07T20:26:26.3212378Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:26:26.3212712Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:26:26.3213063Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:26:26.3213461Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:26:26.3213896Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:26:26.3214190Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:26:26.3214455Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:26:26.3214714Z #define _POSIX_NAME_MAX 14
2025-05-07T20:26:26.3215000Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:26:26.3215533Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:26:26.3216036Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:26:26.3216351Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:26:26.3216704Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:26:26.3217021Z #define _WCHAR_T_ 
2025-05-07T20:26:26.3217250Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:26:26.3217613Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:26:26.3217996Z #define RTSIG_MAX 32
2025-05-07T20:26:26.3218381Z #define _STDDEF_H 
2025-05-07T20:26:26.3218618Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:26:26.3218890Z #define _VA_LIST_DEFINED 
2025-05-07T20:26:26.3219138Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:26:26.3219478Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:26:26.3219870Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:26:26.3220200Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:26:26.3220486Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:26:26.3221122Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:26:26.3221651Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:26:26.3222014Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:26:26.3222422Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:26:26.3222737Z #define __unix__ 1
2025-05-07T20:26:26.3222969Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:26.3223254Z #define __INT_WIDTH__ 32
2025-05-07T20:26:26.3223504Z #define __SIZEOF_LONG__ 8
2025-05-07T20:26:26.3223758Z #define _IONBF 2
2025-05-07T20:26:26.3224237Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:26:26.3225007Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:26:26.3225541Z #define __STDC_IEC_559__ 1
2025-05-07T20:26:26.3225794Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:26:26.3226066Z #define __UINT16_C(c) c
2025-05-07T20:26:26.3226314Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:26:26.3226581Z #define STA_DEL 0x0020
2025-05-07T20:26:26.3226824Z #define __CUDACC_VER_MINOR__ 6
2025-05-07T20:26:26.3227081Z #define __id_t_defined 
2025-05-07T20:26:26.3227353Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:26:26.3227806Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:26:26.3228235Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:26:26.3228503Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:26:26.3228758Z #define __DECIMAL_DIG__ 21
2025-05-07T20:26:26.3229012Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:26:26.3229275Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:26:26.3229538Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:26:26.3229803Z #define SING 2
2025-05-07T20:26:26.3230022Z #define STA_FREQHOLD 0x0080
2025-05-07T20:26:26.3230289Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.3230591Z #define cudaStreamDefault 0x00
2025-05-07T20:26:26.3230946Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:26:26.3231312Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:26:26.3231582Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:26:26.3231855Z #define __gnu_linux__ 1
2025-05-07T20:26:26.3232090Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:26:26.3232348Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:26:26.3232602Z #define MAX_INPUT 255
2025-05-07T20:26:26.3232839Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:26:26.3233175Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:26:26.3233552Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:26:26.3233870Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:26:26.3234190Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:26:26.3234594Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:26:26.3235021Z #define _IO_SHOWPOS 02000
2025-05-07T20:26:26.3235349Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:26:26.3235714Z #define _Mfloat_ float
2025-05-07T20:26:26.3235986Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:26:26.3236294Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:26:26.3236588Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:26:26.3237077Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:26:26.3237591Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.3245896Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:26:26.3246253Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:26:26.3246622Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:26:26.3246920Z #define __USE_ISOC11 1
2025-05-07T20:26:26.3247156Z #define _BSD_SIZE_T_ 
2025-05-07T20:26:26.3247398Z #define ADJ_MICRO 0x1000
2025-05-07T20:26:26.3247647Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:26:26.3247918Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:26:26.3248222Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:26:26.3248754Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:26:26.3249072Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:26:26.3249405Z #define __THROW throw ()
2025-05-07T20:26:26.3249768Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:26:26.3250060Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.3250420Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:26:26.3250778Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:26:26.3251056Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:26:26.3251314Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:26:26.3251583Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:26:26.3251841Z #define L_tmpnam 20
2025-05-07T20:26:26.3252071Z #define ___int_wchar_t_h 
2025-05-07T20:26:26.3252421Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:26:26.3252802Z #define isascii(c) __isascii (c)
2025-05-07T20:26:26.3253071Z #define _T_PTRDIFF 
2025-05-07T20:26:26.3253390Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:26:26.3253779Z #define toascii(c) __toascii (c)
2025-05-07T20:26:26.3254058Z #define __GNUC__ 11
2025-05-07T20:26:26.3254318Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:26:26.3254622Z #define __GXX_RTTI 1
2025-05-07T20:26:26.3254850Z #define __pie__ 2
2025-05-07T20:26:26.3255070Z #define __MMX__ 1
2025-05-07T20:26:26.3255291Z #define __cudaCDP2Malloc 
2025-05-07T20:26:26.3255916Z #define __timespec_defined 1
2025-05-07T20:26:26.3256292Z #define L_ctermid 9
2025-05-07T20:26:26.3256618Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:26.3257028Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:26:26.3257543Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:26:26.3257927Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:26:26.3258262Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:26:26.3258563Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:26:26.3258878Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:26:26.3259199Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:26:26.3259471Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:26:26.3259921Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:26:26.3260688Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:26.3261290Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:26:26.3261605Z #define __USE_SVID 1
2025-05-07T20:26:26.3261866Z #define __constant__ __location__(constant)
2025-05-07T20:26:26.3262178Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:26:26.3262483Z #define __device__ __location__(device)
2025-05-07T20:26:26.3262815Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:26:26.3263141Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:26:26.3263411Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:26:26.3263732Z #define CUDART_DEVICE __device__
2025-05-07T20:26:26.3264110Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:26:26.3264486Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:26:26.3264772Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:26:26.3265153Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:26:26.3265540Z #define __STDC_UTF_16__ 1
2025-05-07T20:26:26.3265797Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:26:26.3266178Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:26:26.3266603Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:26:26.3266929Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:26:26.3267208Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:26:26.3267473Z #define NGROUPS_MAX 65536
2025-05-07T20:26:26.3267735Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:26:26.3268003Z #define __USE_ISOC95 1
2025-05-07T20:26:26.3268226Z #define _TIME_H 1
2025-05-07T20:26:26.3268502Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:26:26.3269115Z #define __USE_ISOC99 1
2025-05-07T20:26:26.3269446Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:26:26.3269827Z #define HOST_NAME_MAX 64
2025-05-07T20:26:26.3270277Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:26:26.3270546Z #define _IOS_ATEND 4
2025-05-07T20:26:26.3270779Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:26:26.3271117Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:26:26.3271527Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:26.3271871Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:26:26.3272162Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:26:26.3272486Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:26:26.3272800Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:26:26.3273061Z #define _STDIO_H 1
2025-05-07T20:26:26.3273473Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:26:26.3274115Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:26:26.3274530Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:26.3274912Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:26:26.3275218Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:26:26.3275486Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:26:26.3275767Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:26:26.3276063Z #define __cpp_raw_strings 200710L
2025-05-07T20:26:26.3276363Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.3276689Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:26:26.3276967Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:26:26.3277246Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:26:26.3277556Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:26:26.3277834Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:26:26.3278120Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:26:26.3278479Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:26:26.3278859Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:26:26.3279110Z #define __USE_XOPEN 1
2025-05-07T20:26:26.3279359Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:26:26.3279807Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:26.3280257Z #define __USE_XOPEN2K 1
2025-05-07T20:26:26.3280501Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:26:26.3280778Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:26:26.3281084Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:26:26.3281360Z #define __cpp_fold_expressions 201603L
2025-05-07T20:26:26.3281896Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:26:26.3282429Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:26:26.3282716Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:26:26.3283081Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:26:26.3283479Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:26:26.3283876Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:26:26.3284272Z #define __END_NAMESPACE_C99 
2025-05-07T20:26:26.3284555Z #define __glibcxx_integral_traps true
2025-05-07T20:26:26.3284945Z #define _POSIX_PATH_MAX 256
2025-05-07T20:26:26.3285297Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:26:26.3285657Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:26:26.3286024Z #define _ISOC11_SOURCE 1
2025-05-07T20:26:26.3286373Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:26:26.3286672Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:26:26.3286981Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:26:26.3287345Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:26:26.3287737Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:26:26.3288018Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:26:26.3288285Z #define _IO_UNITBUF 020000
2025-05-07T20:26:26.3288538Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:26:26.3288803Z #define __FD_SETSIZE 1024
2025-05-07T20:26:26.3289198Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:26:26.3289474Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:26:26.3289823Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:26:26.3290185Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:26:26.3290539Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:26:26.3290855Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:26:26.3291182Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:26:26.3291457Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:26:26.3291766Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:26:26.3292107Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:26:26.3292394Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:26:26.3292726Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:26:26.3293027Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:26:26.3293304Z #define __USE_POSIX199506 1
2025-05-07T20:26:26.3293557Z #define _FEATURES_H 1
2025-05-07T20:26:26.3293831Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:26:26.3294257Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:26:26.3294673Z #define __stub_getmsg 
2025-05-07T20:26:26.3294911Z #define _IO_FIXED 010000
2025-05-07T20:26:26.3295194Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:26:26.3295508Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:26:26.3295787Z #define __stub_setlogin 
2025-05-07T20:26:26.3296030Z #define __stub_fattach 
2025-05-07T20:26:26.3296273Z #define __cplusplus 201703L
2025-05-07T20:26:26.3296548Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:26:26.3296834Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:26:26.3297091Z #define INFINITY (__builtin_inff())
2025-05-07T20:26:26.3297376Z #define _IO_UNBUFFERED 2
2025-05-07T20:26:26.3297875Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:26:26.3298578Z #define _IO_INTERNAL 010
2025-05-07T20:26:26.3298827Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:26:26.3299176Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:26.3299540Z #define __dev_t_defined 
2025-05-07T20:26:26.3299781Z #define __DEPRECATED 1
2025-05-07T20:26:26.3300018Z #define __S32_TYPE int
2025-05-07T20:26:26.3300280Z #define __cpp_rvalue_references 200610L
2025-05-07T20:26:26.3300576Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:26:26.3300840Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:26:26.3301100Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:26:26.3301709Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:26:26.3302352Z #define _G_HAVE_MREMAP 1
2025-05-07T20:26:26.3302667Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:26.3303013Z #define OVERFLOW 3
2025-05-07T20:26:26.3303260Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:26:26.3303573Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:26:26.3303876Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:26.3304256Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:26:26.3304593Z #define __SSE2_MATH__ 1
2025-05-07T20:26:26.3304845Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:26:26.3305154Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:26.3305465Z #define _IO_STDIO_H 
2025-05-07T20:26:26.3305719Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:26:26.3306013Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:26:26.3306343Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:26:26.3306647Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.3306963Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:26:26.3307230Z #define __amd64 1
2025-05-07T20:26:26.3307457Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:26:26.3307725Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:26:26.3308025Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:26:26.3308322Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:26:26.3308639Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:26:26.3308908Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:26:26.3309434Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:26:26.3309715Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:26:26.3309963Z #define __bounded 
2025-05-07T20:26:26.3310207Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:26:26.3310584Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:26:26.3310869Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:26:26.3311143Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:26:26.3311425Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.3311745Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:26:26.3312169Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:26.3312577Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:26:26.3312854Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:26:26.3313196Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:26:26.3313559Z #define STA_PLL 0x0001
2025-05-07T20:26:26.3313852Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:26:26.3314128Z #define __GNUG__ 11
2025-05-07T20:26:26.3314369Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:26:26.3314635Z #define _T_WCHAR 
2025-05-07T20:26:26.3314872Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:26:26.3315170Z #define __specialization_static 
2025-05-07T20:26:26.3315477Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:26:26.3315790Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:26:26.3316058Z #define cudaArraySparse 0x40
2025-05-07T20:26:26.3316328Z #define STA_PPSFREQ 0x0002
2025-05-07T20:26:26.3316577Z #define __GLIBCXX__ 20230528
2025-05-07T20:26:26.3316867Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:26:26.3317172Z #define _WCHAR_T 
2025-05-07T20:26:26.3317391Z #define __cudaCDP2Free 
2025-05-07T20:26:26.3318043Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:26:26.3318757Z #define __cpp_nsdmi 200809L
2025-05-07T20:26:26.3319183Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:26:26.3319625Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:26:26.3319914Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:26:26.3320180Z #define cudaArrayCubemap 0x04
2025-05-07T20:26:26.3320513Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:26.3320867Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:26:26.3321118Z #define __NO_CTYPE 1
2025-05-07T20:26:26.3321354Z #define __stub_bdflush 
2025-05-07T20:26:26.3321727Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:26:26.3322157Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:26:26.3322468Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:26:26.3322738Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:26:26.3323022Z #define __cpp_initializer_lists 200806L
2025-05-07T20:26:26.3323336Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:26:26.3323654Z #define __U16_TYPE unsigned short int
2025-05-07T20:26:26.3324035Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:26:26.3324389Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:26:26.3324676Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:26:26.3324964Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:26:26.3325312Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:26:26.3325660Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:26:26.3325943Z #define _IO_STDIO 040000
2025-05-07T20:26:26.3326282Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:26:26.3326677Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:26:26.3326997Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:26:26.3327297Z #define _PTRDIFF_T 
2025-05-07T20:26:26.3327521Z #define _MOVE_H 1
2025-05-07T20:26:26.3327747Z #define __cpp_hex_float 201603L
2025-05-07T20:26:26.3328019Z #define ADJ_TAI 0x0080
2025-05-07T20:26:26.3328252Z #define __ptrvalue 
2025-05-07T20:26:26.3328581Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:26:26.3328839Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:26:26.3329129Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:26:26.3329512Z #define MATH_ERREXCEPT 2
2025-05-07T20:26:26.3329773Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:26:26.3330060Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:26:26.3330462Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:26:26.3330838Z #define __USE_GNU 1
2025-05-07T20:26:26.3331075Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:26:26.3331356Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:26:26.3331622Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:26:26.3332014Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:26:26.3332410Z #define WEXITED 4
2025-05-07T20:26:26.3332625Z #define _IO_NO_READS 4
2025-05-07T20:26:26.3332929Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:26:26.3333289Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:26:26.3333569Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:26:26.3333904Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:26:26.3334255Z #define __uid_t_defined 
2025-05-07T20:26:26.3334542Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:26:26.3334938Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:26:26.3335275Z #define WNOHANG 1
2025-05-07T20:26:26.3335527Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:26:26.3335836Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:26:26.3336120Z #define cudaEventDefault 0x00
2025-05-07T20:26:26.3336434Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:26:26.3336752Z #define NL_SETMAX INT_MAX
2025-05-07T20:26:26.3336993Z #define __x86_64 1
2025-05-07T20:26:26.3337231Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:26:26.3337625Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:26.3338220Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:26:26.3338738Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:26.3339182Z #define __PTRDIFF_T 
2025-05-07T20:26:26.3339514Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:26:26.3339899Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:26:26.3340180Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:26.3340474Z #define _Mlong_double_ long double
2025-05-07T20:26:26.3340764Z #define __cpp_lambdas 200907L
2025-05-07T20:26:26.3341027Z #define _IO_DEC 020
2025-05-07T20:26:26.3341256Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:26:26.3341536Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:26:26.3341831Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:26:26.3342114Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:26:26.3342381Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:26:26.3342687Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:26:26.3343025Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:26:26.3343312Z #define _ANSI_STDDEF_H 
2025-05-07T20:26:26.3343601Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:26:26.3343969Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:26:26.3344349Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:26:26.3344737Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:26:26.3345027Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:26:26.3345320Z #define __cpp_template_auto 201606L
2025-05-07T20:26:26.3345683Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:26:26.3346056Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:26:26.3346323Z #define __key_t_defined 
2025-05-07T20:26:26.3346577Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:26:26.3346954Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:26:26.3347425Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:26:26.3347917Z #define __GNUC_VA_LIST 
2025-05-07T20:26:26.3348260Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:26.3348649Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:26:26.3349005Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:26:26.3349291Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:26:26.3349598Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:26:26.3349852Z #define __WCOREFLAG 0x80
2025-05-07T20:26:26.3350109Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:26:26.3350419Z #define cudaEventDisableTiming 0x02
2025-05-07T20:26:26.3350697Z #define __LP64__ 1
2025-05-07T20:26:26.3350947Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:26:26.3351270Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:26:26.3351553Z #define _IO_off64_t __off64_t
2025-05-07T20:26:26.3351822Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.3352091Z #define __time_t_defined 1
2025-05-07T20:26:26.3352354Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:26:26.3352707Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:26:26.3353078Z #define __USE_UNIX98 1
2025-05-07T20:26:26.3353326Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:26:26.3353609Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:26:26.3353933Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:26:26.3354237Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:26:26.3354552Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:26:26.3354817Z #define SEEK_CUR 1
2025-05-07T20:26:26.3355054Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:26.3355332Z #define _ASSERT_H 1
2025-05-07T20:26:26.3356433Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:26:26.3357086Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:26:26.3357372Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:26:26.3357630Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:26:26.3357904Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:26:26.3358188Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:26:26.3358568Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:26.3358986Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:26:26.3359659Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:26:26.3360324Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:26:26.3360627Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:26:26.3360988Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:26:26.3361372Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:26:26.3361643Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:26:26.3361932Z #define cudaArrayDefault 0x00
2025-05-07T20:26:26.3362221Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:26:26.3362517Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:26:26.3362806Z #define TLOSS 5
2025-05-07T20:26:26.3363031Z #define __ssize_t_defined 
2025-05-07T20:26:26.3363282Z #define __CUDACC_VER_BUILD__ 85
2025-05-07T20:26:26.3363563Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:26:26.3363869Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:26:26.3364163Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:26:26.3364536Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:26:26.3364929Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:26:26.3365223Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:26:26.3365512Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:26:26.3365831Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:26:26.3366138Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:26:26.3366425Z #define __REGISTER_PREFIX__ 
2025-05-07T20:26:26.3366687Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:26:26.3367026Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:26:26.3367386Z #define _IOS_NOREPLACE 64
2025-05-07T20:26:26.3367884Z #define __cdecl 
2025-05-07T20:26:26.3368132Z #define cudaEventInterprocess 0x04
2025-05-07T20:26:26.3368461Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:26:26.3368796Z #define LOGIN_NAME_MAX 256
2025-05-07T20:26:26.3369235Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:26:26.3369512Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:26:26.3369802Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:26:26.3370092Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:26:26.3370408Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:26:26.3370740Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:26:26.3378634Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:26.3379097Z #define ADJ_NANO 0x2000
2025-05-07T20:26:26.3379419Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:26:26.3379778Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:26:26.3380080Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:26:26.3380361Z #define __FLT_DIG__ 6
2025-05-07T20:26:26.3380721Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:26:26.3381130Z #define __NO_INLINE__ 1
2025-05-07T20:26:26.3381451Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:26.3381809Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:26:26.3382080Z #define ADJ_STATUS 0x0010
2025-05-07T20:26:26.3382355Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:26:26.3382653Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:26:26.3382935Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:26.3383241Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:26:26.3383547Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:26:26.3383992Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:26:26.3384419Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:26:26.3384769Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:26:26.3385127Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:26:26.3385382Z #define MAX_CANON 255
2025-05-07T20:26:26.3385620Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:26:26.3385891Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:26:26.3386173Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:26:26.3386469Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:26:26.3386787Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:26:26.3387102Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:26:26.3387388Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:26:26.3387722Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:26:26.3388096Z #define __VERSION__ "11.4.0"
2025-05-07T20:26:26.3388462Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:26:26.3388766Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:26:26.3389158Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:26:26.3389451Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:26:26.3389768Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:26:26.3390080Z #define __UINT64_C(c) c ## UL
2025-05-07T20:26:26.3390350Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:26:26.3390605Z #define _SYS_TYPES_H 1
2025-05-07T20:26:26.3390858Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:26:26.3391134Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:26:26.3391388Z #define _SYS_CDEFS_H 1
2025-05-07T20:26:26.3391636Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:26:26.3391920Z #define __cpp_unicode_characters 201411L
2025-05-07T20:26:26.3392218Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:26:26.3392487Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:26:26.3392795Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:26:26.3393067Z #define FP_SUBNORMAL 3
2025-05-07T20:26:26.3393325Z #define cudaOccupancyDefault 0x00
2025-05-07T20:26:26.3393614Z #define _INITIALIZER_LIST 
2025-05-07T20:26:26.3393897Z #define _STDC_PREDEF_H 1
2025-05-07T20:26:26.3394175Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:26:26.3394461Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:26:26.3394758Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:26:26.3395182Z #define _IO_file_flags _flags
2025-05-07T20:26:26.3395454Z #define __USE_XOPEN2K8 1
2025-05-07T20:26:26.3395712Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:26:26.3396078Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:26:26.3396362Z #define HUGE 3.40282347e+38F
2025-05-07T20:26:26.3396638Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:26:26.3397015Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:26:26.3397422Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:26:26.3397737Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:26:26.3398010Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:26:26.3398271Z #define _BSD_SOURCE 1
2025-05-07T20:26:26.3398516Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:26:26.3399382Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:26:26.3400253Z #define __catch(X) catch(X)
2025-05-07T20:26:26.3400527Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:26:26.3400825Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:26:26.3401107Z #define __TIMER_T_TYPE void *
2025-05-07T20:26:26.3401366Z #define __STRING(x) #x
2025-05-07T20:26:26.3401615Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:26:26.3401887Z #define _T_PTRDIFF_ 
2025-05-07T20:26:26.3402140Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:26:26.3402451Z #define cudaEventWaitExternal 0x01
2025-05-07T20:26:26.3402729Z #define __unbounded 
2025-05-07T20:26:26.3402977Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:26.3403272Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:26:26.3403558Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:26.3403859Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:26:26.3404141Z #define __cpp_lib_is_final 201402L
2025-05-07T20:26:26.3404445Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:26:26.3404781Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:26:26.3405095Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:26:26.3405383Z #define __managed__ __location__(managed)
2025-05-07T20:26:26.3405686Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:26:26.3406094Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:26.3406521Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:26:26.3406778Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:26:26.3407159Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:26:26.3407567Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:26:26.3407828Z #define _SYS_SIZE_T_H 
2025-05-07T20:26:26.3408119Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:26:26.3408464Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:26:26.3408755Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:26:26.3409048Z #define _CRTIMP 
2025-05-07T20:26:26.3409282Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:26:26.3409599Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:26.3409927Z #define STA_PPSJITTER 0x0200
2025-05-07T20:26:26.3410292Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:26:26.3410715Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.3411034Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:26:26.3411327Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:26:26.3411624Z #define __SIZE_T__ 
2025-05-07T20:26:26.3411845Z #define __stub_gtty 
2025-05-07T20:26:26.3412076Z #define __pid_t_defined 
2025-05-07T20:26:26.3412343Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:26:26.3412650Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:26.3412969Z #define __glibcxx_function_requires(...) 
2025-05-07T20:26:26.3413270Z #define __SM_80_RT_HPP__ 
2025-05-07T20:26:26.3413526Z #define __need_clockid_t 
2025-05-07T20:26:26.3413799Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:26:26.3414089Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:26:26.3414521Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:26:26.3414842Z #define _IO_HEX 0100
2025-05-07T20:26:26.3415112Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:26:26.3415551Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:26:26.3415861Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:26:26.3416147Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:26:26.3416561Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:26.3417011Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:26:26.3417327Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:26:26.3417631Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:26:26.3417740Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:26:26.3417852Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:26:26.3417937Z #define __stub_sstk 
2025-05-07T20:26:26.3418033Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:26:26.3418306Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:26:26.3418388Z #define __wur 
2025-05-07T20:26:26.3418509Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:26:26.3418603Z #define _G_HAVE_MMAP 1
2025-05-07T20:26:26.3418693Z #define _IO_OCT 040
2025-05-07T20:26:26.3418796Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:26:26.3418888Z #define NL_MSGMAX INT_MAX
2025-05-07T20:26:26.3418980Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:26:26.3419117Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:26:26.3419211Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:26:26.3419315Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:26:26.3419514Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:26:26.3419611Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:26:26.3419703Z #define _STL_ALGOBASE_H 1
2025-05-07T20:26:26.3419819Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:26:26.3419910Z #define __off64_t_defined 
2025-05-07T20:26:26.3420011Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:26:26.3420114Z #define __FLT128_DIG__ 33
2025-05-07T20:26:26.3420222Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:26:26.3420325Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:26:26.3420410Z #define __INT32_C(c) c
2025-05-07T20:26:26.3420512Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:26:26.3420616Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:26:26.3420713Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:26:26.3420805Z #define __PDP_ENDIAN 3412
2025-05-07T20:26:26.3420899Z #define _ISOC95_SOURCE 1
2025-05-07T20:26:26.3420996Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:26:26.3421128Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:26:26.3421236Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:26:26.3421327Z #define __SM_90_RT_HPP__ 
2025-05-07T20:26:26.3421430Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:26:26.3421532Z #define __have_pthread_attr_t 1
2025-05-07T20:26:26.3421635Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:26:26.3421872Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:26:26.3421982Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:26:26.3422086Z #define __cudaCDP2EventRecord 
2025-05-07T20:26:26.3422192Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:26:26.3422278Z #define htole32(x) (x)
2025-05-07T20:26:26.3422535Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:26:26.3422667Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:26:26.3422767Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:26:26.3422927Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:26:26.3423074Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:26:26.3423203Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:26:26.3423354Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:26:26.3423451Z #define ADJ_OFFSET 0x0001
2025-05-07T20:26:26.3423554Z #define cudaArrayLayered 0x01
2025-05-07T20:26:26.3423829Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:26:26.3423965Z #define cudaEventRecordDefault 0x00
2025-05-07T20:26:26.3424071Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:26:26.3424198Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:26:26.3424357Z #define unix 1
2025-05-07T20:26:26.3424452Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:26:26.3424556Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:26:26.3424653Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:26:26.3424774Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:26:26.3424871Z #define __USE_POSIX 1
2025-05-07T20:26:26.3424968Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:26:26.3425111Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:26:26.3425205Z #define __THROWNL throw ()
2025-05-07T20:26:26.3425298Z #define __cpp_rtti 199711L
2025-05-07T20:26:26.3425412Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:26:26.3425502Z #define __PMT(args) args
2025-05-07T20:26:26.3425617Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.3425782Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:26:26.3425900Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:26:26.3425992Z #define _SIZE_T_DECLARED 
2025-05-07T20:26:26.3426102Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:26:26.3426197Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:26:26.3426603Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:26:26.3426706Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:26:26.3426802Z #define XATTR_LIST_MAX 65536
2025-05-07T20:26:26.3426905Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:26:26.3427051Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:26:26.3427135Z #define _WCHAR_T_H 
2025-05-07T20:26:26.3427235Z #define __FLT64X_DIG__ 18
2025-05-07T20:26:26.3427327Z #define _IO_SHOWBASE 0200
2025-05-07T20:26:26.3427416Z #define _POSIX_QLIMIT 1
2025-05-07T20:26:26.3427523Z #define __INT8_TYPE__ signed char
2025-05-07T20:26:26.3427625Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:26:26.3427715Z #define __CUDA_ARCH__ 520
2025-05-07T20:26:26.3427830Z #define __cpp_digit_separators 201309L
2025-05-07T20:26:26.3427916Z #define __ELF__ 1
2025-05-07T20:26:26.3428023Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:26:26.3428123Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:26:26.3428210Z #define STA_INS 0x0010
2025-05-07T20:26:26.3428315Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:26:26.3428488Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:26:26.3428583Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:26:26.3428686Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:26:26.3428798Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.3428909Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.3429014Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:26:26.3429120Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:26:26.3429223Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:26:26.3429384Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:26:26.3429543Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:26:26.3429653Z #define _IO_funlockfile(_fp) 
2025-05-07T20:26:26.3429981Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:26.3430111Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:26:26.3430215Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:26:26.3430304Z #define __FLT_RADIX__ 2
2025-05-07T20:26:26.3430410Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:26:26.3430584Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:26:26.3430681Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:26:26.3430783Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:26:26.3430889Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:26:26.3430988Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:26:26.3431096Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:26:26.3431293Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:26:26.3431380Z #define WORD_BIT 32
2025-05-07T20:26:26.3431475Z #define _IO_USER_BUF 1
2025-05-07T20:26:26.3431570Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:26:26.3431792Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.3431910Z #define cudaHostAllocPortable 0x01
2025-05-07T20:26:26.3432013Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:26:26.3432113Z #define __long_double_t long double
2025-05-07T20:26:26.3432215Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:26:26.3432309Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:26:26.3432726Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:26:26.3432833Z #define __k8 1
2025-05-07T20:26:26.3433034Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:26:26.3433212Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:26:26.3433339Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:26:26.3433441Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:26:26.3433551Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:26:26.3433655Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:26:26.3433759Z #define __blksize_t_defined 
2025-05-07T20:26:26.3433859Z #define _IO_SHOWPOINT 0400
2025-05-07T20:26:26.3433961Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:26:26.3434080Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:26:26.3434182Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:26:26.3434291Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:26:26.3434394Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:26:26.3434492Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:26:26.3434748Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:26:26.3435099Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:26:26.3435204Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:26:26.3435307Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:26:26.3435398Z #define SEEK_SET 0
2025-05-07T20:26:26.3435498Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:26:26.3435598Z #define __CUDA_API_VER_MINOR__ 6
2025-05-07T20:26:26.3435803Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:26:26.3435909Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:26:26.3436020Z #define __cudaCDP2GetLastError 
2025-05-07T20:26:26.3436116Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:26:26.3436209Z #define _MATH_H_MATHDEF 1
2025-05-07T20:26:26.3436539Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:26:26.3436639Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:26:26.3436739Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:26:26.3436839Z #define __stub_sigreturn 
2025-05-07T20:26:26.3437085Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:26:26.3437184Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:26:26.3437284Z #define __HOST_CONFIG_H__ 
2025-05-07T20:26:26.3437385Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:26:26.3437482Z #define CLOCK_TAI 11
2025-05-07T20:26:26.3437592Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:26:26.3437682Z #define __restrict_arr 
2025-05-07T20:26:26.3437801Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:26:26.3437945Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:26:26.3438475Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:26.3438666Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:26:26.3438751Z #define __USE_MISC 1
2025-05-07T20:26:26.3438863Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:26:26.3439052Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:26:26.3439143Z #define _GCC_LIMITS_H_ 
2025-05-07T20:26:26.3439238Z #define __LDBL_DIG__ 18
2025-05-07T20:26:26.3439336Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:26:26.3439517Z #define __malloc_and_calloc_defined 
2025-05-07T20:26:26.3439618Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:26:26.3439723Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:26:26.3439806Z #define __x86_64__ 1
2025-05-07T20:26:26.3439898Z #define _SIZE_T_ 
2025-05-07T20:26:26.3440768Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:26:26.3440880Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:26:26.3440979Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:26:26.3441101Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:26:26.3441226Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:26:26.3441324Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:26:26.3441440Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:26:26.3441572Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:26:26.3441715Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:26:26.3441818Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:26:26.3442286Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:26.3442417Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:26:26.3442569Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:26:26.3442673Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:26:26.3442771Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:26:26.3442867Z #define STA_FLL 0x0008
2025-05-07T20:26:26.3443017Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:26:26.3443116Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:26:26.3443245Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.3443362Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:26:26.3443458Z #define __stub_revoke 
2025-05-07T20:26:26.3443555Z #define __timer_t_defined 1
2025-05-07T20:26:26.3443713Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:26:26.3443832Z #define INT_MAX __INT_MAX__
2025-05-07T20:26:26.3443943Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:26:26.3444051Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:26:26.3444154Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:26:26.3444260Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:26:26.3444373Z #define cudaArrayTextureGather 0x08
2025-05-07T20:26:26.3444479Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:26:26.3444626Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:26:26.3444733Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:26:26.3444826Z #define _IO_off_t __off_t
2025-05-07T20:26:26.3444915Z #define __FLT64_DIG__ 15
2025-05-07T20:26:26.3445149Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:26:26.3445253Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:26:26.3445383Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.3445514Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:26:26.3445612Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:26:26.3445717Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:26:26.3445808Z #define NULL __null
2025-05-07T20:26:26.3445942Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:26:26.3446049Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:26:26.3446161Z #define __U64_TYPE unsigned long int
2025-05-07T20:26:26.3446257Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.3446358Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:26:26.3446443Z #define FP_ZERO 2
2025-05-07T20:26:26.3446634Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:26:26.3446797Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:26:26.3446907Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.3447071Z #define __WCHAR_T__ 
2025-05-07T20:26:26.3447174Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:26:26.3447373Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:26:26.3447527Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:26:26.3447635Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:26:26.3447760Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:26:26.3447884Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:26:26.3448015Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:26:26.3448146Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:26:26.3448246Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:26:26.3448339Z #define _SIGSET_H_types 1
2025-05-07T20:26:26.3448464Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:26:26.3448576Z #define __cpp_unicode_literals 200710L
2025-05-07T20:26:26.3448730Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:26:26.3448843Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:26:26.3448972Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:26:26.3449104Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:26:26.3449215Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:26:26.3449350Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:26:26.3449527Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:26:26.3449636Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:26:26.3449744Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:26:26.3449847Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:26:26.3449947Z #define STA_MODE 0x4000
2025-05-07T20:26:26.3450059Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:26:26.3450170Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:26:26.3450296Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:26:26.3450398Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:26:26.3450496Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:26:26.3450616Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:26:26.3450713Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:26:26.3450835Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:26:26.3450925Z #define __SIZE_WIDTH__ 64
2025-05-07T20:26:26.3451045Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.3451133Z #define __SEG_FS 1
2025-05-07T20:26:26.3451224Z #define _IO_size_t size_t
2025-05-07T20:26:26.3451323Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:26:26.3451433Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:26:26.3451520Z #define __stub_lchmod 
2025-05-07T20:26:26.3451613Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:26:26.3451729Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.3451828Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:26:26.3451915Z #define __SEG_GS 1
2025-05-07T20:26:26.3452104Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:26:26.3452195Z #define _IOS_APPEND 8
2025-05-07T20:26:26.3452300Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:26:26.3452394Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:26:26.3452492Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:26:26.3452601Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:26:26.3452704Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:26:26.3452791Z #define htole16(x) (x)
2025-05-07T20:26:26.3452910Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:26.3453007Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:26:26.3453102Z #define __INT16_TYPE__ short int
2025-05-07T20:26:26.3453214Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:26:26.3453324Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:26:26.3453441Z #define __cpp_structured_bindings 201606L
2025-05-07T20:26:26.3453566Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:26:26.3453750Z #define __SIZEOF_INT__ 4
2025-05-07T20:26:26.3453869Z #define __WCLONE 0x80000000
2025-05-07T20:26:26.3453972Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:26:26.3454156Z #define SEEK_HOLE 4
2025-05-07T20:26:26.3454252Z #define TIMER_ABSTIME 1
2025-05-07T20:26:26.3454348Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:26:26.3454442Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:26:26.3454625Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:26.3454741Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.3454839Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:26:26.3454956Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:26:26.3455054Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.3455184Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:26:26.3455276Z #define _LINUX_LIMITS_H 
2025-05-07T20:26:26.3455361Z #define linux 1
2025-05-07T20:26:26.3455462Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:26:26.3456120Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:26:26.3456251Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:26:26.3456354Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:26:26.3456465Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:26:26.3456620Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:26:26.3456725Z #define __cpp_lib_hypot 201603
2025-05-07T20:26:26.3456824Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.3456924Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:26:26.3457030Z #define MOD_NANO ADJ_NANO
2025-05-07T20:26:26.3457117Z #define htole64(x) (x)
2025-05-07T20:26:26.3457227Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:26:26.3457356Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:26:26.3457453Z #define _IO_UPPERCASE 01000
2025-05-07T20:26:26.3457959Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:26:26.3458126Z #define __USE_POSIX2 1
2025-05-07T20:26:26.3458242Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:26:26.3458340Z #define __WALL 0x40000000
2025-05-07T20:26:26.3458439Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:26:26.3458525Z #define _XLOCALE_H 1
2025-05-07T20:26:26.3458633Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:26:26.3458733Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:26:26.3458842Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:26:26.3458954Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:26:26.3459044Z #define __EXCEPTIONS 1
2025-05-07T20:26:26.3459154Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:26:26.3459351Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:26:26.3459439Z #define __WORDSIZE 64
2025-05-07T20:26:26.3459545Z #define CLOCK_MONOTONIC 1
2025-05-07T20:26:26.3459635Z #define _STL_RELOPS_H 1
2025-05-07T20:26:26.3459731Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:26:26.3459838Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:26:26.3459938Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:26:26.3460038Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:26:26.3460145Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:26:26.3460451Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:26:26.3460699Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:26.3460830Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:26:26.3460929Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:26:26.3461043Z #define __cpp_range_based_for 201603L
2025-05-07T20:26:26.3461159Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:26:26.3461262Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:26:26.3461379Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:26:26.3461566Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:26:26.3461665Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:26:26.3461765Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:26:26.3462161Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:26:26.3462353Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:26.3462472Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:26:26.3462691Z #define _STRING_H 1
2025-05-07T20:26:26.3462800Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:26:26.3462892Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:26:26.3462991Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:26:26.3463138Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:26:26.3463235Z #define __code_model_small__ 1
2025-05-07T20:26:26.3463325Z #define _PSTL_CONFIG_H 
2025-05-07T20:26:26.3463436Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:26:26.3463552Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:26:26.3463660Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:26:26.3463770Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:26:26.3464112Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:26.3464220Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:26:26.3464309Z #define le64toh(x) (x)
2025-05-07T20:26:26.3464401Z #define FILENAME_MAX 4096
2025-05-07T20:26:26.3464558Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:26:26.3464681Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:26:26.3464765Z #define L_cuserid 9
2025-05-07T20:26:26.3464861Z #define __ino_t_defined 
2025-05-07T20:26:26.3464942Z #define __k8__ 1
2025-05-07T20:26:26.3465043Z #define __INTPTR_TYPE__ long int
2025-05-07T20:26:26.3465161Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:26:26.3465250Z #define __int8_t_defined 
2025-05-07T20:26:26.3465349Z #define __WCHAR_TYPE__ int
2025-05-07T20:26:26.3465451Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:26:26.3465567Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:26:26.3465673Z #define __SLONGWORD_TYPE long int
2025-05-07T20:26:26.3465760Z #define _IOS_TRUNC 16
2025-05-07T20:26:26.3465885Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:26:26.3466051Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:26:26.3466139Z #define __HAVE_COLUMN 
2025-05-07T20:26:26.3466227Z #define __stub_fdetach 
2025-05-07T20:26:26.3466649Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:26:26.3466733Z #define __pic__ 2
2025-05-07T20:26:26.3466862Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.3466962Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:26:26.3467058Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:26:26.3467168Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:26:26.3467258Z #define __stub_chflags 
2025-05-07T20:26:26.3467347Z #define CLOCK_BOOTTIME 7
2025-05-07T20:26:26.3467442Z #define __need_IOV_MAX 
2025-05-07T20:26:26.3467553Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:26:26.3467659Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:26:26.3467770Z #define __cpp_decltype 200707L
2025-05-07T20:26:26.3467873Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:26:26.3467967Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:26:26.3468081Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:26:26.3468174Z #define TTY_NAME_MAX 32
2025-05-07T20:26:26.3468350Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:26:26.3468474Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.3468651Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:26:26.3468770Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:26:26.3468867Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:26:26.3468964Z #define STA_PPSTIME 0x0004
2025-05-07T20:26:26.3469052Z #define __import__ 
2025-05-07T20:26:26.3469142Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:26:26.3469281Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:26:26.3469377Z #define __export__ 
2025-05-07T20:26:26.3469497Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:26:26.3469707Z #define cudaMemAttachHost 0x02
2025-05-07T20:26:26.3469875Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:26.3469974Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:26:26.3470152Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:26:26.3470257Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:26:26.3470350Z #define _WCHAR_T_DECLARED 
2025-05-07T20:26:26.3470483Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:26:26.3470602Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:26:26.3470711Z #define __cpp_inline_variables 201606L
2025-05-07T20:26:26.3470810Z #define WNOWAIT 0x01000000
2025-05-07T20:26:26.3470895Z #define PLOSS 6
2025-05-07T20:26:26.3470990Z #define M_LN10 2.30258509299404568402
2025-05-07T20:26:26.3471276Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:26:26.3471366Z #define EXIT_SUCCESS 0
2025-05-07T20:26:26.3471474Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:26:26.3471577Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:26:26.3471681Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:26:26.3471780Z #define __thread__ __thread
2025-05-07T20:26:26.3471890Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:26:26.3471984Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:26:26.3472104Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:26:26.3472334Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:26.3472449Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:26:26.3472551Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:26:26.3472634Z #define __linux__ 1
2025-05-07T20:26:26.3472741Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:26:26.3472870Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:26:26.3472964Z #define __S16_TYPE short int
2025-05-07T20:26:26.3473324Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:26:26.3473442Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:26:26.3473662Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:26:26.3473779Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:26:26.3473905Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:26:26.3473988Z #define _T_SIZE_ 
2025-05-07T20:26:26.3474096Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:26.3474217Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:26:26.3474317Z #define _PSTL_VERSION 12000
2025-05-07T20:26:26.3474441Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:26:26.3474538Z #define __WNOTHREAD 0x20000000
2025-05-07T20:26:26.3474642Z #define _G_va_list __gnuc_va_list
2025-05-07T20:26:26.3474773Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:26:26.3474859Z #define _IOS_INPUT 1
2025-05-07T20:26:26.3474959Z #define __USE_LARGEFILE64 1
2025-05-07T20:26:26.3475074Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:26:26.3475170Z #define __INT64_TYPE__ long int
2025-05-07T20:26:26.3475277Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:26:26.3475378Z #define __shared__ __location__(shared)
2025-05-07T20:26:26.3475472Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:26:26.3475642Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:26:26.3475732Z #define __gid_t_defined 
2025-05-07T20:26:26.3475854Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:26:26.3475955Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:26:26.3476164Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:26:26.3476269Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:26:26.3476362Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:26:26.3476450Z #define ___int_size_t_h 
2025-05-07T20:26:26.3476569Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:26.3476696Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:26:26.3476854Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:26:26.3477057Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:26:26.3477157Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:26:26.3477271Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:26:26.3477368Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:26:26.3477578Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.3477699Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:26:26.3477821Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:26:26.3477915Z #define __clock_t_defined 1
2025-05-07T20:26:26.3478029Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:26:26.3478145Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:26:26.3478238Z #define __GLIBC_MINOR__ 17
2025-05-07T20:26:26.3478335Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:26:26.3478441Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:26:26.3478556Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:26:26.3478657Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:26:26.3478830Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:26.3478923Z #define __SSE__ 1
2025-05-07T20:26:26.3479021Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:26:26.3479118Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:26:26.3479214Z #define _CTYPE_H 1
2025-05-07T20:26:26.3479306Z #define __sigset_t_defined 
2025-05-07T20:26:26.3479404Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:26:26.3479513Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:26:26.3479623Z #define MOD_TAI ADJ_TAI
2025-05-07T20:26:26.3479761Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:26:26.3479898Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:26:26.3480012Z #define __SM_70_RT_H__ 
2025-05-07T20:26:26.3480151Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:26:26.3480267Z #define cudaEventWaitDefault 0x00
2025-05-07T20:26:26.3480364Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:26:26.3480534Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:26.3480630Z #define _POSIX_MAX_CANON 255
2025-05-07T20:26:26.3480769Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:26:26.3480873Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:26:26.3480966Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:26:26.3481050Z #define __amd64__ 1
2025-05-07T20:26:26.3481152Z #define __WINT_WIDTH__ 32
2025-05-07T20:26:26.3481259Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:26:26.3481534Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:26.3481644Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:26:26.3494871Z #define EOF (-1)
2025-05-07T20:26:26.3495031Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:26:26.3495166Z #define __USE_POSIX199309 1
2025-05-07T20:26:26.3495314Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:26:26.3495452Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:26:26.3495599Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:26:26.3495738Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:26:26.3495901Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:26:26.3496007Z #define ____mbstate_t_defined 1
2025-05-07T20:26:26.3496107Z #define STA_NANO 0x2000
2025-05-07T20:26:26.3496203Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:26:26.3496305Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:26:26.3496397Z #define _IO_LINKED 0x80
2025-05-07T20:26:26.3496495Z #define __cpp_lib_launder 201606
2025-05-07T20:26:26.3496597Z #define __SIZEOF_INT128__ 16
2025-05-07T20:26:26.3496700Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:26:26.3496796Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:26:26.3496897Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:26:26.3497043Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:26:26.3497159Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.3497264Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:26.3497360Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:26:26.3497461Z #define __W_CONTINUED 0xffff
2025-05-07T20:26:26.3497552Z #define __ATOMIC_RELAXED 0
2025-05-07T20:26:26.3497687Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:26:26.3498003Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:26.3498313Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:26:26.3498503Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:26:26.3498685Z #define __stub_stty 
2025-05-07T20:26:26.3498853Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:26:26.3498949Z #define le16toh(x) (x)
2025-05-07T20:26:26.3499058Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:26:26.3499232Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:26:26.3499322Z #define _SIZET_ 
2025-05-07T20:26:26.3499414Z #define XATTR_NAME_MAX 255
2025-05-07T20:26:26.3499502Z #define _SVID_SOURCE 1
2025-05-07T20:26:26.3499590Z #define _LP64 1
2025-05-07T20:26:26.3499681Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:26:26.3499920Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:26:26.3500044Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:26:26.3500130Z #define __UINT8_C(c) c
2025-05-07T20:26:26.3500232Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:26:26.3500327Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:26:26.3500444Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:26:26.3500544Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:26:26.3500637Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:26:26.3500736Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:26:26.3500825Z #define CUDARTAPI 
2025-05-07T20:26:26.3500911Z #define IOV_MAX 1024
2025-05-07T20:26:26.3501058Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:26:26.3501159Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:26:26.3501261Z #define cudaMemAttachSingle 0x04
2025-05-07T20:26:26.3501343Z #define __wchar_t__ 
2025-05-07T20:26:26.3501453Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:26:26.3501535Z #define SEEK_END 2
2025-05-07T20:26:26.3501628Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:26:26.3501812Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:26:26.3501911Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:26:26.3502062Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:26:26.3502159Z #define ____FILE_defined 1
2025-05-07T20:26:26.3502277Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:26:26.3502380Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:26:26.3502466Z #define _ISOC99_SOURCE 1
2025-05-07T20:26:26.3502565Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:26:26.3502819Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:26.3502950Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:26:26.3503037Z #define _IO_RIGHT 04
2025-05-07T20:26:26.3503138Z #define __END_NAMESPACE_STD 
2025-05-07T20:26:26.3503326Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:26.3503424Z #define _GLIBCXX_STD_C std
2025-05-07T20:26:26.3503551Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:26:26.3503652Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:26:26.3503779Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:26:26.3503873Z #define _STDDEF_H_ 
2025-05-07T20:26:26.3504062Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:26.3504167Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.3504285Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:26:26.3504485Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:26:26.3504604Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:26.3504749Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:26:26.3504879Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:26:26.3504983Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:26:26.3505091Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:26:26.3505192Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:26:26.3505392Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:26:26.3505491Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:26:26.3505595Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:26:26.3505690Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:26:26.3505986Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:26:26.3506087Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:26:26.3506265Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:26:26.3506370Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:26:26.3506467Z #define __STDCPP_THREADS__ 1
2025-05-07T20:26:26.3506610Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:26:26.3506714Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:26:26.3506808Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:26:26.3506909Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:26:26.3507015Z #define P_tmpdir "/tmp"
2025-05-07T20:26:26.3507135Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:26:26.3507230Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:26:26.3507344Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:26:26.3507510Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:26:26.3507685Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:26:26.3507790Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:26:26.3507913Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:26:26.3508032Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:26:26.3508133Z #define __location__(a) __annotate__(a)
2025-05-07T20:26:26.3508363Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:26:26.3508467Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:26:26.3508581Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:26:26.3508677Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:26:26.3508773Z #define __STDC_UTF_32__ 1
2025-05-07T20:26:26.3508866Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:26:26.3508972Z #define NAN (__builtin_nanf (""))
2025-05-07T20:26:26.3509072Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:26:26.3509153Z #define __FXSR__ 1
2025-05-07T20:26:26.3509242Z #define _SIZE_T 
2025-05-07T20:26:26.3509345Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:26:26.3509462Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:26:26.3509637Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:26.3509786Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:26:26.3509881Z #define _IO_ssize_t __ssize_t
2025-05-07T20:26:26.3509989Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:26:26.3510172Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:26.3510376Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:26:26.3510474Z #define _GXX_NULLPTR_T 
2025-05-07T20:26:26.3510597Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:26:26.3510693Z #define FOPEN_MAX 16
2025-05-07T20:26:26.3510781Z #define __BIG_ENDIAN 4321
2025-05-07T20:26:26.3510903Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:26.3511006Z #define __suseconds_t_defined 
2025-05-07T20:26:26.3511092Z #define __off_t_defined 
2025-05-07T20:26:26.3511181Z #define stderr stderr
2025-05-07T20:26:26.3511278Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:26:26.3511389Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:26:26.3511485Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:26:26.3511583Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:26:26.3511992Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:26:26.3512088Z #define __mode_t_defined 
2025-05-07T20:26:26.3512171Z #define _GCC_SIZE_T 
2025-05-07T20:26:26.3512268Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:26.3512376Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:26:26.3512482Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:26:26.3512575Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:26:26.3512790Z #define __UINT32_C(c) c ## U
2025-05-07T20:26:26.3512897Z #define __cpp_alias_templates 200704L
2025-05-07T20:26:26.3513003Z #define cudaHostAllocMapped 0x02
2025-05-07T20:26:26.3513190Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:26:26.3513281Z #define _STL_ITERATOR_H 1
2025-05-07T20:26:26.3513362Z #define __size_t__ 
2025-05-07T20:26:26.3513501Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:26:26.3513598Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:26:26.3513713Z #define cudaEventRecordExternal 0x01
2025-05-07T20:26:26.3513890Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:26:26.3514001Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:26:26.3514182Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:26:26.3514266Z #define _ENDIAN_H 1
2025-05-07T20:26:26.3514372Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:26:26.3514473Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:26:26.3514581Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:26:26.3514661Z #define __try try
2025-05-07T20:26:26.3514763Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:26:26.3514855Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:26:26.3514956Z #define __INT8_MAX__ 0x7f
2025-05-07T20:26:26.3515216Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:26:26.3515306Z #define __LONG_WIDTH__ 64
2025-05-07T20:26:26.3515392Z #define __PIC__ 2
2025-05-07T20:26:26.3515505Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:26:26.3515624Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:26:26.3515760Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:26:26.3515857Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:26:26.3515952Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:26:26.3516140Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:26.3516240Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:26:26.3516352Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:26:26.3516441Z #define _IO_uid_t __uid_t
2025-05-07T20:26:26.3516539Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:26:26.3516674Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:26:26.3516770Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:26:26.3516914Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:26.3517024Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:26:26.3517145Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:26:26.3517230Z #define LONG_BIT 64
2025-05-07T20:26:26.3517345Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:26:26.3517445Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:26:26.3517572Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:26:26.3517672Z #define __fsfilcnt_t_defined 
2025-05-07T20:26:26.3517765Z #define __blkcnt_t_defined 
2025-05-07T20:26:26.3518046Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:26.3518139Z #define __USE_LARGEFILE 1
2025-05-07T20:26:26.3518241Z #define __cpp_constexpr 201603L
2025-05-07T20:26:26.3518344Z #define CUDART_VERSION 12060
2025-05-07T20:26:26.3518434Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:26:26.3518539Z #define cudaDeviceMapHost 0x08
2025-05-07T20:26:26.3518638Z #define _GLIBCXX_CMATH 1
2025-05-07T20:26:26.3518836Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:26:26.3518929Z #define __lldiv_t_defined 1
2025-05-07T20:26:26.3519015Z #define __SSE2__ 1
2025-05-07T20:26:26.3519097Z #define _IOLBF 1
2025-05-07T20:26:26.3519198Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:26:26.3519300Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:26:26.3519408Z #define __cpp_deduction_guides 201703L
2025-05-07T20:26:26.3519509Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:26:26.3519619Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:26:26.3519709Z #define __INT32_TYPE__ int
2025-05-07T20:26:26.3519805Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:26:26.3519996Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:26:26.3520098Z #define __cpp_exceptions 199711L
2025-05-07T20:26:26.3520199Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:26:26.3520308Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:26:26.3520474Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:26:26.3520595Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:26:26.3520755Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:26:26.3520858Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:26:26.3520954Z #define __SWORD_TYPE long int
2025-05-07T20:26:26.3521050Z #define __INTMAX_TYPE__ long int
2025-05-07T20:26:26.3521153Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:26:26.3521249Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:26:26.3521342Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:26:26.3521631Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:26.3521726Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:26:26.3521880Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:26:26.3521965Z #define _T_SIZE 
2025-05-07T20:26:26.3522071Z #define cudaHostAllocDefault 0x00
2025-05-07T20:26:26.3522199Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:26.3522336Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:26:26.3522428Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:26:26.3522526Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:26:26.3522648Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:26:26.3522743Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:26:26.3522845Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.3522935Z #define __ATOMIC_CONSUME 1
2025-05-07T20:26:26.3523112Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:26:26.3523207Z #define __GNUC_MINOR__ 4
2025-05-07T20:26:26.3523308Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:26:26.3523400Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:26:26.3523524Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.3523610Z #define __PIE__ 2
2025-05-07T20:26:26.3523719Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:26:26.3523819Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:26:26.3524015Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:26:26.3524239Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:26.3524329Z #define __nlink_t_defined 
2025-05-07T20:26:26.3524457Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:26:26.3524575Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:26:26.3524662Z #define _XOPEN_LIM_H 1
2025-05-07T20:26:26.3524922Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:26.3525046Z #define __cpp_template_template_args 201611L
2025-05-07T20:26:26.3525152Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:26:26.3525259Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:26:26.3525358Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:26:26.3525446Z #define __FILE_defined 1
2025-05-07T20:26:26.3525627Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:26:26.3525725Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:26:26.3525820Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:26:26.3525934Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:26:26.3526049Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:26:26.3526156Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:26:26.3526263Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:26:26.3526347Z #define __INT16_C(c) c
2025-05-07T20:26:26.3526451Z #define __U32_TYPE unsigned int
2025-05-07T20:26:26.3526549Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:26:26.3526670Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:26:26.3526756Z #define __STDC__ 1
2025-05-07T20:26:26.3526852Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:26:26.3526952Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:26:26.3527143Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:26:26.3527298Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:26:26.3527388Z #define __FLT32X_DIG__ 15
2025-05-07T20:26:26.3527574Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:26:26.3527670Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:26:26.3527785Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:26:26.3527900Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:26:26.3527997Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:26:26.3528110Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:26:26.3528193Z #define stdin stdin
2025-05-07T20:26:26.3528281Z #define __ino64_t_defined 
2025-05-07T20:26:26.3528373Z #define STA_CLK 0x8000
2025-05-07T20:26:26.3528467Z #define __clockid_t_defined 1
2025-05-07T20:26:26.3528614Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:26:26.3528786Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:26:26.3528896Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:26:26.3528998Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:26:26.3529110Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:26:26.3529213Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:26:26.3529427Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:26:26.3529519Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:26:26.3530057Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:26:26.3530148Z #define DOMAIN 1
2025-05-07T20:26:26.3530241Z #define M_LN2 0.69314718055994530942
2025-05-07T20:26:26.3530322Z #define __NVCC__ 1
2025-05-07T20:26:26.3530437Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:26:26.3530550Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:26.3530652Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:26:26.3530769Z #define __throw_exception_again throw
2025-05-07T20:26:26.3530862Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:26:26.3530956Z #define __EXCEPTION_H 1
2025-05-07T20:26:26.3531057Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:26:26.3531159Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:26:26.3531470Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:26.3531582Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:26:26.3531681Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:26:26.3531786Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:26:26.3531888Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:26:26.3531982Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:26:26.3532131Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:26:26.3532237Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:26.3532356Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:26:26.3532453Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:26:26.3532556Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:26:26.3532658Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:26:26.3532758Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:26:26.3532897Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:26:26.3532998Z #define __useconds_t_defined 
2025-05-07T20:26:26.3533097Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:26:26.3533280Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:26:26.3533438Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:26:26.3533525Z #define __SSE_MATH__ 1
2025-05-07T20:26:26.3533634Z #define _IO_wint_t wint_t
2025-05-07T20:26:26.3533740Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:26:26.3533848Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:26:26.3533951Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:26:26.3534065Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:26:26.3534159Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:26:26.3534376Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:26:26.3534462Z #define __USE_ATFILE 1
2025-05-07T20:26:26.3534555Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:26:26.3534730Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:26:26.3534815Z #define _GCC_PTRDIFF_T 
2025-05-07T20:26:26.3535044Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:26.3535148Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:26:26.3535248Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:26:26.3535357Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:26:26.3535466Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:26:26.3535548Z #define _STDLIB_H 1
2025-05-07T20:26:26.3535693Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:26:26.3535787Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:26:26.3535879Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:26:26.3536011Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:26.3536125Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:26.3536219Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:26:26.3536409Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:26:26.3536570Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:26:26.3536682Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:26:26.3536797Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:26:26.3536888Z #define __ldiv_t_defined 1
2025-05-07T20:26:26.3537074Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:26:26.3537167Z #define ___int_ptrdiff_t_h 
2025-05-07T20:26:26.3537335Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:26.3537444Z #define __cudaCDP2EventDestroy 
2025-05-07T20:26:26.3537537Z #define __HOST_DEFINES_H__ 
2025-05-07T20:26:26.3537639Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:26:26.3537744Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:26.3537846Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:26:26.3537928Z #define CUDART_CB 
2025-05-07T20:26:26.3538172Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:26:26.3538318Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:26:26.3538415Z #define MB_LEN_MAX 16
2025-05-07T20:26:26.3538644Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:26.3538742Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:26:26.3538871Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:26:26.3538983Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:26:26.3539079Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:26:26.3539233Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:26:26.3539340Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:26:26.3539424Z #define _GNU_SOURCE 1
2025-05-07T20:26:26.3539515Z #define __stub_putmsg 
2025-05-07T20:26:26.3539598Z #define __CUDACC__ 1
2025-05-07T20:26:26.3539692Z #define __N(msgid) (msgid)
2025-05-07T20:26:26.3539780Z #define __P(args) args
2025-05-07T20:26:26.3540038Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:26:26.3540148Z #define __cpp_init_captures 201304L
2025-05-07T20:26:26.3540263Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:26:26.3540354Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:26:26.3540478Z #define __cpp_lib_as_const 201510
2025-05-07T20:26:26.3540559Z #define __WCHAR_T 
2025-05-07T20:26:26.3540650Z #define __ATOMIC_RELEASE 3
2025-05-07T20:26:26.3540749Z #define __fsblkcnt_t_defined 
2025-05-07T20:26:26.3540865Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:26:26.3540969Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:26:26.3540984Z 
2025-05-07T20:26:26.3732735Z 
2025-05-07T20:26:26.3733556Z + conda run -n build_binary nvcc --version
2025-05-07T20:26:26.3733574Z 
2025-05-07T20:26:28.2687066Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:26:28.2687746Z Copyright (c) 2005-2024 NVIDIA Corporation
2025-05-07T20:26:28.2688473Z Built on Tue_Oct_29_23:50:19_PDT_2024
2025-05-07T20:26:28.2688861Z Cuda compilation tools, release 12.6, V12.6.85
2025-05-07T20:26:28.2689333Z Build cuda_12.6.r12.6/compiler.35059454_0
2025-05-07T20:26:28.2689787Z 
2025-05-07T20:26:28.3314571Z 
2025-05-07T20:26:28.3326235Z /usr/bin/nvidia-smi
2025-05-07T20:26:28.3330321Z + nvidia-smi
2025-05-07T20:26:28.3330472Z 
2025-05-07T20:26:28.3504656Z Wed May  7 20:26:28 2025       
2025-05-07T20:26:28.3505295Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:28.3506140Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:26:28.3507069Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:28.3507963Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:26:28.3508814Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:26:28.3509639Z |                                         |                        |               MIG M. |
2025-05-07T20:26:28.3510228Z |=========================================+========================+======================|
2025-05-07T20:26:28.3672465Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:26:28.3673290Z |  0%   27C    P8             16W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:26:28.3673912Z |                                         |                        |                  N/A |
2025-05-07T20:26:28.3674569Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:28.3677431Z                                                                                          
2025-05-07T20:26:28.3678058Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:28.3678783Z | Processes:                                                                              |
2025-05-07T20:26:28.3679685Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:26:28.3680393Z |        ID   ID                                                               Usage      |
2025-05-07T20:26:28.3680911Z |=========================================================================================|
2025-05-07T20:26:28.3683926Z |  No running processes found                                                             |
2025-05-07T20:26:28.3684787Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:28.6066475Z 
2025-05-07T20:26:28.6071004Z [INSTALL] Successfully installed CUDA 12.6.3
2025-05-07T20:26:28.6124953Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3
2025-05-07T20:26:28.6125587Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3[0m
2025-05-07T20:26:28.6138927Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:26:28.6139527Z env:
2025-05-07T20:26:28.6139873Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:26:28.6140237Z   BUILD_ENV: build_binary
2025-05-07T20:26:28.6140663Z   BUILD_TARGET: genai
2025-05-07T20:26:28.6141016Z   BUILD_VARIANT: cuda
2025-05-07T20:26:28.6141353Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:26:28.6141730Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:26:28.6142155Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:26:28.6142611Z ##[endgroup]
2025-05-07T20:26:28.9502167Z ################################################################################
2025-05-07T20:26:28.9502611Z # Install PyTorch (PIP)
2025-05-07T20:26:28.9503037Z #
2025-05-07T20:26:28.9516914Z # [2025-05-07T20:26:28.951Z] + install_pytorch_pip build_binary nightly cuda/12.6.3
2025-05-07T20:26:28.9517410Z ################################################################################
2025-05-07T20:26:28.9517795Z 
2025-05-07T20:26:28.9545484Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:26:29.9614352Z Channels:
2025-05-07T20:26:29.9614741Z  - conda-forge
2025-05-07T20:26:29.9615483Z Platform: linux-64
2025-05-07T20:26:33.2497022Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:26:33.9653459Z Solving environment: \ | / done
2025-05-07T20:26:34.1815718Z 
2025-05-07T20:26:34.1816383Z ## Package Plan ##
2025-05-07T20:26:34.1816862Z 
2025-05-07T20:26:34.1817197Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:34.1817624Z 
2025-05-07T20:26:34.1817815Z   added / updated specs:
2025-05-07T20:26:34.1818277Z     - numpy
2025-05-07T20:26:34.1818504Z 
2025-05-07T20:26:34.1818530Z 
2025-05-07T20:26:34.1818697Z The following packages will be downloaded:
2025-05-07T20:26:34.1818976Z 
2025-05-07T20:26:34.1819120Z     package                    |            build
2025-05-07T20:26:34.1819532Z     ---------------------------|-----------------
2025-05-07T20:26:34.1820036Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:26:34.1820577Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:26:34.1821132Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:26:34.1821732Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:26:34.1822249Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:26:34.1822819Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:26:34.1837826Z     numpy-2.2.5                |  py310hefbff90_0         7.6 MB  conda-forge
2025-05-07T20:26:34.1838275Z     ------------------------------------------------------------
2025-05-07T20:26:34.1838660Z                                            Total:        14.8 MB
2025-05-07T20:26:34.1838926Z 
2025-05-07T20:26:34.1839063Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:34.1839327Z 
2025-05-07T20:26:34.1839599Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:26:34.1840170Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:26:34.1840685Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:26:34.1841211Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:26:34.1841970Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:26:34.1842740Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:26:34.1843923Z   numpy              conda-forge/linux-64::numpy-2.2.5-py310hefbff90_0 
2025-05-07T20:26:34.1844415Z 
2025-05-07T20:26:34.1844422Z 
2025-05-07T20:26:34.1844427Z 
2025-05-07T20:26:34.1844659Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:34.1845165Z numpy-2.2.5          | 7.6 MB    |            |   0% 
2025-05-07T20:26:34.1845495Z 
2025-05-07T20:26:34.1845951Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:34.1846217Z 
2025-05-07T20:26:34.1846228Z 
2025-05-07T20:26:34.1846464Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:26:34.1846720Z 
2025-05-07T20:26:34.1846724Z 
2025-05-07T20:26:34.1846728Z 
2025-05-07T20:26:34.1852207Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:26:34.1852582Z 
2025-05-07T20:26:34.1852586Z 
2025-05-07T20:26:34.1852589Z 
2025-05-07T20:26:34.1862337Z 
2025-05-07T20:26:34.1871001Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:26:34.1871364Z 
2025-05-07T20:26:34.1871383Z 
2025-05-07T20:26:34.1871389Z 
2025-05-07T20:26:34.1871394Z 
2025-05-07T20:26:34.1877122Z 
2025-05-07T20:26:34.1879003Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:26:34.1879376Z 
2025-05-07T20:26:34.1879382Z 
2025-05-07T20:26:34.1879387Z 
2025-05-07T20:26:34.1879392Z 
2025-05-07T20:26:34.1879402Z 
2025-05-07T20:26:34.1879407Z 
2025-05-07T20:26:34.2517467Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:34.2518191Z 
2025-05-07T20:26:34.2518196Z 
2025-05-07T20:26:34.2518201Z 
2025-05-07T20:26:34.2520717Z 
2025-05-07T20:26:34.2531566Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:34.2531920Z 
2025-05-07T20:26:34.2531925Z 
2025-05-07T20:26:34.2538728Z 
2025-05-07T20:26:34.3537851Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:34.3538340Z 
2025-05-07T20:26:34.3538346Z 
2025-05-07T20:26:34.3538351Z 
2025-05-07T20:26:34.3538356Z 
2025-05-07T20:26:34.3538361Z 
2025-05-07T20:26:34.3589259Z 
2025-05-07T20:26:34.3932530Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:26:34.3932914Z 
2025-05-07T20:26:34.3932920Z 
2025-05-07T20:26:34.3932925Z 
2025-05-07T20:26:34.3932931Z 
2025-05-07T20:26:34.3932950Z 
2025-05-07T20:26:34.4061169Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:26:34.4061551Z 
2025-05-07T20:26:34.4061556Z 
2025-05-07T20:26:34.4061575Z 
2025-05-07T20:26:34.4061580Z 
2025-05-07T20:26:34.4061586Z 
2025-05-07T20:26:34.4068388Z 
2025-05-07T20:26:34.4070732Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:34.4071114Z 
2025-05-07T20:26:34.4071119Z 
2025-05-07T20:26:34.4071125Z 
2025-05-07T20:26:34.4071130Z 
2025-05-07T20:26:34.4071736Z 
2025-05-07T20:26:34.5057477Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:34.5057858Z 
2025-05-07T20:26:34.5059744Z 
2025-05-07T20:26:34.5181799Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:26:34.5255173Z numpy-2.2.5          | 7.6 MB    |            |   0% 
2025-05-07T20:26:34.5255514Z 
2025-05-07T20:26:34.5255520Z 
2025-05-07T20:26:34.5255795Z 
2025-05-07T20:26:34.5255803Z 
2025-05-07T20:26:34.5256654Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:34.5257008Z 
2025-05-07T20:26:34.5257013Z 
2025-05-07T20:26:34.5257018Z 
2025-05-07T20:26:34.5257036Z 
2025-05-07T20:26:34.5295514Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:34.5295902Z 
2025-05-07T20:26:34.5295908Z 
2025-05-07T20:26:34.5295913Z 
2025-05-07T20:26:34.5295917Z 
2025-05-07T20:26:34.5296225Z 
2025-05-07T20:26:34.5345404Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:34.5345765Z 
2025-05-07T20:26:34.5345770Z 
2025-05-07T20:26:34.5348020Z 
2025-05-07T20:26:34.5354163Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:34.5354426Z 
2025-05-07T20:26:34.5354748Z 
2025-05-07T20:26:34.5355166Z 
2025-05-07T20:26:34.5452454Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:34.5452725Z 
2025-05-07T20:26:34.5526590Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:34.5526847Z 
2025-05-07T20:26:34.5528775Z 
2025-05-07T20:26:34.5733662Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:34.5733942Z 
2025-05-07T20:26:34.5733958Z 
2025-05-07T20:26:34.5733962Z 
2025-05-07T20:26:34.5733966Z 
2025-05-07T20:26:34.5733969Z 
2025-05-07T20:26:34.5734970Z 
2025-05-07T20:26:34.6184865Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:34.6548332Z numpy-2.2.5          | 7.6 MB    | ######8    |  69% 
2025-05-07T20:26:34.6550067Z 
2025-05-07T20:26:34.6550426Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:34.6550683Z 
2025-05-07T20:26:34.6739064Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:34.6836854Z numpy-2.2.5          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:34.6837490Z 
2025-05-07T20:26:34.6837497Z 
2025-05-07T20:26:34.6840034Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:34.6840565Z 
2025-05-07T20:26:34.6840573Z 
2025-05-07T20:26:34.8030334Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:34.8030730Z 
2025-05-07T20:26:35.1329888Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:35.1337380Z numpy-2.2.5          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:35.1337864Z                                                      
2025-05-07T20:26:35.1338249Z 
2025-05-07T20:26:35.1338517Z                                                      [A
2025-05-07T20:26:35.1338796Z 
2025-05-07T20:26:35.1338802Z 
2025-05-07T20:26:35.1339034Z                                                      [A[A
2025-05-07T20:26:35.1339319Z 
2025-05-07T20:26:35.1339324Z 
2025-05-07T20:26:35.1339330Z 
2025-05-07T20:26:35.1339549Z                                                      [A[A[A
2025-05-07T20:26:35.1339758Z 
2025-05-07T20:26:35.1339762Z 
2025-05-07T20:26:35.1339765Z 
2025-05-07T20:26:35.1339769Z 
2025-05-07T20:26:35.1340024Z                                                      [A[A[A[A
2025-05-07T20:26:35.1340317Z 
2025-05-07T20:26:35.1340322Z 
2025-05-07T20:26:35.1340328Z 
2025-05-07T20:26:35.1340333Z 
2025-05-07T20:26:35.1340338Z 
2025-05-07T20:26:35.1340578Z                                                      [A[A[A[A[A
2025-05-07T20:26:35.1340892Z 
2025-05-07T20:26:35.1340897Z 
2025-05-07T20:26:35.1340902Z 
2025-05-07T20:26:35.1340908Z 
2025-05-07T20:26:35.1340913Z 
2025-05-07T20:26:35.1340918Z 
2025-05-07T20:26:35.1341118Z                                                      [A[A[A[A[A[A done
2025-05-07T20:26:35.2345541Z Preparing transaction: \ done
2025-05-07T20:26:35.4351588Z Verifying transaction: / - done
2025-05-07T20:26:35.5359506Z Executing transaction: | done
2025-05-07T20:26:35.7126535Z ################################################################################
2025-05-07T20:26:35.7126955Z # Install Package From PyTorch PIP: torch
2025-05-07T20:26:35.7127257Z #
2025-05-07T20:26:35.7142071Z # [2025-05-07T20:26:35.713Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3
2025-05-07T20:26:35.7142560Z ################################################################################
2025-05-07T20:26:35.7142772Z 
2025-05-07T20:26:35.7157658Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:26:35.8104609Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:26:35.8105570Z ################################################################################
2025-05-07T20:26:35.8106303Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:26:35.8106707Z #
2025-05-07T20:26:35.8121755Z # [2025-05-07T20:26:35.811Z] + __prepare_pip_arguments torch nightly cuda/12.6.3
2025-05-07T20:26:35.8122533Z ################################################################################
2025-05-07T20:26:35.8122764Z 
2025-05-07T20:26:35.8143888Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:26:35.8170804Z [INSTALL] Extracted package variant: cu126
2025-05-07T20:26:35.8187931Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:26:35.8188708Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:26:35.8197413Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:26:35.8207092Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ...
2025-05-07T20:26:35.8229045Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:55.6194257Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:55.6194894Z Collecting torch
2025-05-07T20:27:55.6195575Z   Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:27:55.6196299Z Collecting filelock (from torch)
2025-05-07T20:27:55.6196822Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:27:55.6197984Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from torch) (4.13.2)
2025-05-07T20:27:55.6199282Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:27:55.6199805Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:27:55.6200653Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 54.6 MB/s eta 0:00:00
2025-05-07T20:27:55.6201006Z Collecting networkx (from torch)
2025-05-07T20:27:55.6201523Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:27:55.6202194Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 28.6 MB/s eta 0:00:00
2025-05-07T20:27:55.6202546Z Collecting jinja2 (from torch)
2025-05-07T20:27:55.6203028Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:27:55.6203545Z Collecting fsspec (from torch)
2025-05-07T20:27:55.6204054Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:27:55.6204641Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch)
2025-05-07T20:27:55.6205371Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
2025-05-07T20:27:55.6206176Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 44.9 MB/s eta 0:00:00
2025-05-07T20:27:55.6206600Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch)
2025-05-07T20:27:55.6207342Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB)
2025-05-07T20:27:55.6208142Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 10.4 MB/s eta 0:00:00
2025-05-07T20:27:55.6208550Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch)
2025-05-07T20:27:55.6209259Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB)
2025-05-07T20:27:55.6210044Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 41.1 MB/s eta 0:00:00
2025-05-07T20:27:55.6210444Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch)
2025-05-07T20:27:55.6211136Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
2025-05-07T20:27:55.6211907Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 34.5 MB/s eta 0:00:00
2025-05-07T20:27:55.6212303Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch)
2025-05-07T20:27:55.6213317Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
2025-05-07T20:27:55.6214194Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 58.4 MB/s eta 0:00:00
2025-05-07T20:27:55.6214574Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch)
2025-05-07T20:27:55.6215264Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB)
2025-05-07T20:27:55.6216051Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 137.0 MB/s eta 0:00:00
2025-05-07T20:27:55.6216452Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch)
2025-05-07T20:27:55.6217195Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB)
2025-05-07T20:27:55.6217975Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 158.1 MB/s eta 0:00:00
2025-05-07T20:27:55.6218569Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch)
2025-05-07T20:27:55.6219306Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB)
2025-05-07T20:27:55.6220095Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 158.8 MB/s eta 0:00:00
2025-05-07T20:27:55.6220492Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch)
2025-05-07T20:27:55.6221203Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB)
2025-05-07T20:27:55.6222096Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 134.1 MB/s eta 0:00:00
2025-05-07T20:27:55.6222486Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:27:55.6223194Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:27:55.6223978Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 147.5 MB/s eta 0:00:00
2025-05-07T20:27:55.6224364Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:27:55.6225136Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:27:55.6225915Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch)
2025-05-07T20:27:55.6226584Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB)
2025-05-07T20:27:55.6227287Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch)
2025-05-07T20:27:55.6228071Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB)
2025-05-07T20:27:55.6228938Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 164.9 MB/s eta 0:00:00
2025-05-07T20:27:55.6229332Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch)
2025-05-07T20:27:55.6230133Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:27:55.6230955Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:27:55.6231800Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:27:55.6233088Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1)
2025-05-07T20:27:55.6233965Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:27:55.6234531Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:27:55.6235465Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 56.0 MB/s eta 0:00:00
2025-05-07T20:27:55.6235842Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:27:55.6236685Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
2025-05-07T20:27:55.6237751Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp310-cp310-manylinux_2_28_x86_64.whl (825.5 MB)
2025-05-07T20:27:55.6238565Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.5/825.5 MB 37.8 MB/s eta 0:00:00
2025-05-07T20:27:55.6239333Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB)
2025-05-07T20:27:55.6240268Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 11.5 MB/s eta 0:00:00
2025-05-07T20:27:55.6241027Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:27:55.6241872Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 102.8 MB/s eta 0:00:00
2025-05-07T20:27:55.6242924Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB)
2025-05-07T20:27:55.6243858Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 134.3 MB/s eta 0:00:00
2025-05-07T20:27:55.6245609Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:27:55.6247337Z 
2025-05-07T20:27:55.6249395Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126
2025-05-07T20:27:55.6251571Z 
2025-05-07T20:27:57.8293526Z torch                    2.8.0.dev20250507+cu126
2025-05-07T20:27:57.8296088Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126)
2025-05-07T20:28:01.2755338Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:04.7150224Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:04.7150786Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:08.0891158Z True
2025-05-07T20:28:08.0891502Z True
2025-05-07T20:28:08.0891655Z 
2025-05-07T20:28:08.1523087Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:08.1561889Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:08.1562514Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:08.1575507Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:08.1575860Z env:
2025-05-07T20:28:08.1576091Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:08.1576394Z   BUILD_ENV: build_binary
2025-05-07T20:28:08.1576648Z   BUILD_TARGET: genai
2025-05-07T20:28:08.1576894Z   BUILD_VARIANT: cuda
2025-05-07T20:28:08.1577135Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:08.1577391Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:08.1577824Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:08.1578257Z ##[endgroup]
2025-05-07T20:28:08.4945569Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:08.4947422Z ################################################################################
2025-05-07T20:28:08.4947903Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:08.4948283Z #
2025-05-07T20:28:08.4964277Z # [2025-05-07T20:28:08.496Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:08.4964681Z ################################################################################
2025-05-07T20:28:08.4964893Z 
2025-05-07T20:28:08.4981617Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:08.5916898Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:08.5927429Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:08.5928066Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:08.5928485Z 
2025-05-07T20:28:08.6818627Z 
2025-05-07T20:28:08.6819295Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:08.6841656Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:14.6380002Z Collecting environment information...
2025-05-07T20:28:14.6380383Z PyTorch version: 2.8.0.dev20250507+cu126
2025-05-07T20:28:14.6380718Z Is debug build: False
2025-05-07T20:28:14.6380977Z CUDA used to build PyTorch: 12.6
2025-05-07T20:28:14.6381270Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:14.6381478Z 
2025-05-07T20:28:14.6381602Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:14.6381925Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:14.6382251Z Clang version: Could not collect
2025-05-07T20:28:14.6382547Z CMake version: Could not collect
2025-05-07T20:28:14.6382815Z Libc version: glibc-2.34
2025-05-07T20:28:14.6382980Z 
2025-05-07T20:28:14.6383305Z Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:28:14.6383938Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:14.6384369Z Is CUDA available: True
2025-05-07T20:28:14.6384623Z CUDA runtime version: 12.6.85
2025-05-07T20:28:14.6384900Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:14.6385215Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:14.6385547Z Nvidia driver version: 570.133.07
2025-05-07T20:28:14.6395427Z cuDNN version: Could not collect
2025-05-07T20:28:14.6395768Z HIP runtime version: N/A
2025-05-07T20:28:14.6396023Z MIOpen runtime version: N/A
2025-05-07T20:28:14.6396293Z Is XNNPACK available: True
2025-05-07T20:28:14.6396460Z 
2025-05-07T20:28:14.6396550Z CPU:
2025-05-07T20:28:14.6396767Z Architecture:                         x86_64
2025-05-07T20:28:14.6397113Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:14.6397511Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:14.6397895Z Byte Order:                           Little Endian
2025-05-07T20:28:14.6398215Z CPU(s):                               16
2025-05-07T20:28:14.6398515Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:14.6399148Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:14.6399494Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:14.6399813Z CPU family:                           23
2025-05-07T20:28:14.6400100Z Model:                                49
2025-05-07T20:28:14.6400382Z Thread(s) per core:                   2
2025-05-07T20:28:14.6400677Z Core(s) per socket:                   8
2025-05-07T20:28:14.6400961Z Socket(s):                            1
2025-05-07T20:28:14.6401296Z Stepping:                             0
2025-05-07T20:28:14.6401623Z BogoMIPS:                             5599.99
2025-05-07T20:28:14.6403744Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:14.6405884Z Hypervisor vendor:                    KVM
2025-05-07T20:28:14.6406197Z Virtualization type:                  full
2025-05-07T20:28:14.6406540Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:14.6406893Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:14.6407405Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:14.6407760Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:14.6408075Z NUMA node(s):                         1
2025-05-07T20:28:14.6408373Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:14.6408710Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:14.6409092Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:14.6409448Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:14.6409800Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:14.6410154Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:14.6410505Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:14.6410875Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:14.6411426Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:14.6412011Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:14.6412559Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:14.6413254Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:14.6414125Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:14.6414808Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:14.6415166Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:14.6415404Z 
2025-05-07T20:28:14.6415509Z Versions of relevant libraries:
2025-05-07T20:28:14.6415779Z [pip3] numpy==2.2.5
2025-05-07T20:28:14.6416023Z [pip3] nvidia-cublas-cu12==12.6.4.1
2025-05-07T20:28:14.6416330Z [pip3] nvidia-cuda-cupti-cu12==12.6.80
2025-05-07T20:28:14.6416652Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77
2025-05-07T20:28:14.6416962Z [pip3] nvidia-cuda-runtime-cu12==12.6.77
2025-05-07T20:28:14.6417279Z [pip3] nvidia-cudnn-cu12==9.5.1.17
2025-05-07T20:28:14.6417574Z [pip3] nvidia-cufft-cu12==11.3.0.4
2025-05-07T20:28:14.6417859Z [pip3] nvidia-curand-cu12==10.3.7.77
2025-05-07T20:28:14.6418259Z [pip3] nvidia-cusolver-cu12==11.7.1.2
2025-05-07T20:28:14.6418576Z [pip3] nvidia-cusparse-cu12==12.5.4.2
2025-05-07T20:28:14.6418998Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:14.6419305Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:14.6419594Z [pip3] nvidia-nvjitlink-cu12==12.6.85
2025-05-07T20:28:14.6419889Z [pip3] nvidia-nvtx-cu12==12.6.77
2025-05-07T20:28:14.6420177Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:14.6420479Z [pip3] torch==2.8.0.dev20250507+cu126
2025-05-07T20:28:14.6420856Z [conda] cuda-cudart               12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.6421340Z [conda] cuda-cudart-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.6421855Z [conda] cuda-cudart-dev_linux-64  12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:14.6422381Z [conda] cuda-cudart-static        12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.6422910Z [conda] cuda-cudart-static_linux-64 12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:14.6423442Z [conda] cuda-cudart_linux-64      12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:28:14.6423921Z [conda] cuda-cupti                12.6.80              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.6424393Z [conda] cuda-cupti-dev            12.6.80              h5888daf_0    conda-forge
2025-05-07T20:28:14.6424879Z [conda] cuda-libraries            12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:14.6425372Z [conda] cuda-libraries-dev        12.6.3               ha770c72_0    conda-forge
2025-05-07T20:28:14.6425842Z [conda] cuda-nvrtc                12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.6426400Z [conda] cuda-nvrtc-dev            12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:14.6426859Z [conda] cuda-nvtx                 12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.6427306Z [conda] cuda-opencl               12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.6427781Z [conda] cuda-opencl-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:28:14.6428263Z [conda] cuda-runtime              12.6.3               ha804496_0    conda-forge
2025-05-07T20:28:14.6428722Z [conda] libcublas                 12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:14.6429181Z [conda] libcublas-dev             12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:28:14.6429646Z [conda] libcufft                  11.3.0.4             hbd13f7d_0    conda-forge
2025-05-07T20:28:14.6430102Z [conda] libcufft-dev              11.3.0.4             h5888daf_0    conda-forge
2025-05-07T20:28:14.6430564Z [conda] libcurand                 10.3.7.77            hbd13f7d_0    conda-forge
2025-05-07T20:28:14.6431030Z [conda] libcurand-dev             10.3.7.77            h5888daf_0    conda-forge
2025-05-07T20:28:14.6431502Z [conda] libcusolver               11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:14.6431985Z [conda] libcusolver-dev           11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:28:14.6432465Z [conda] libcusparse               12.5.4.2             hbd13f7d_0    conda-forge
2025-05-07T20:28:14.6432946Z [conda] libcusparse-dev           12.5.4.2             h5888daf_0    conda-forge
2025-05-07T20:28:14.6433430Z [conda] libnvjitlink              12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:28:14.6433913Z [conda] libnvjitlink-dev          12.6.85              h5888daf_0    conda-forge
2025-05-07T20:28:14.6434370Z [conda] numpy                     2.2.5           py310hefbff90_0    conda-forge
2025-05-07T20:28:14.6434831Z [conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
2025-05-07T20:28:14.6435336Z [conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
2025-05-07T20:28:14.6435829Z [conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
2025-05-07T20:28:14.6436331Z [conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
2025-05-07T20:28:14.6436829Z [conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
2025-05-07T20:28:14.6437392Z [conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
2025-05-07T20:28:14.6437861Z [conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
2025-05-07T20:28:14.6438348Z [conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
2025-05-07T20:28:14.6438836Z [conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
2025-05-07T20:28:14.6439325Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:14.6439812Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:14.6440296Z [conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
2025-05-07T20:28:14.6440777Z [conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
2025-05-07T20:28:14.6441249Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:14.6441765Z [conda] torch                     2.8.0.dev20250507+cu126          pypi_0    pypi
2025-05-07T20:28:14.6442035Z 
2025-05-07T20:28:14.7086723Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:14.7087431Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:14.7099504Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:14.7099861Z env:
2025-05-07T20:28:14.7100094Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:14.7100426Z   BUILD_ENV: build_binary
2025-05-07T20:28:14.7100681Z   BUILD_TARGET: genai
2025-05-07T20:28:14.7101111Z   BUILD_VARIANT: cuda
2025-05-07T20:28:14.7101353Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:14.7101622Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:14.7101933Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:14.7102273Z ##[endgroup]
2025-05-07T20:28:15.0498127Z ################################################################################
2025-05-07T20:28:15.0498576Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:15.0498833Z #
2025-05-07T20:28:15.0514627Z # [2025-05-07T20:28:15.051Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:15.0515053Z ################################################################################
2025-05-07T20:28:15.0515274Z 
2025-05-07T20:28:15.0531270Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:15.1464528Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:15.1488765Z [BUILD] Running git submodules update ...
2025-05-07T20:28:15.1512958Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:15.1876056Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:15.1876762Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:15.1877409Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:15.1877817Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:15.1878233Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:15.1878689Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:15.1879105Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:15.1909597Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:15.2461242Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:15.2482147Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:17.6526029Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:17.6701096Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:17.7717798Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:17.7744727Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:18.0236519Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:18.0267760Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:18.1285819Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:18.1319527Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:18.4768683Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:18.4802826Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:18.5362168Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:18.5365781Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:18.6152263Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:18.6179342Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:18.6628993Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:28:18.7234913Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:18.7262238Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:18.8414110Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:18.8442913Z   Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:18.9417091Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:18.9453018Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:18.9985527Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:19.0607684Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:19.0640356Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:19.1587707Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:19.1612419Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:19.2638389Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:19.2667132Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:19.3749253Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:19.3777610Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:19.4720155Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:19.4749415Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:19.5821470Z Collecting tomli>=1.1.0 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:19.5850629Z   Downloading tomli-2.2.1-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:19.6854143Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:19.6887400Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:19.8180804Z Collecting exceptiongroup>=1.0.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:19.8208340Z   Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB)
2025-05-07T20:28:19.9172162Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:19.9319385Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:19.9809311Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:20.0359333Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:20.0385756Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:20.0867595Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:20.1408503Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:20.1438724Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:20.1904098Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:20.2562751Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:20.2589256Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:20.3096389Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:20.3571109Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:20.4058991Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:20.9045579Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 55.8 MB/s eta 0:00:00
2025-05-07T20:28:20.9075858Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:20.9554788Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:21.0111825Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:21.0539850Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:21.1156569Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:21.1612860Z Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB)
2025-05-07T20:28:21.2233452Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.2/751.2 kB 8.2 MB/s eta 0:00:00
2025-05-07T20:28:21.2283167Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:21.2738601Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:21.3230748Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:21.3748161Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:21.4265349Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:21.4720813Z Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB)
2025-05-07T20:28:21.5196188Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:21.5683980Z Downloading tomli-2.2.1-py3-none-any.whl (14 kB)
2025-05-07T20:28:21.6230053Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:21.6750006Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:21.7273473Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:21.7757719Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:22.0048477Z Installing collected packages: sortedcontainers, tomli, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, exceptiongroup, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:28:24.5254665Z 
2025-05-07T20:28:24.5325506Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 exceptiongroup-1.2.2 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 tomli-2.2.1 typing-inspect-0.9.0
2025-05-07T20:28:24.7103747Z ################################################################################
2025-05-07T20:28:24.7104494Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:24.7104763Z #
2025-05-07T20:28:24.7119881Z # [2025-05-07T20:28:24.711Z] + install_triton_pip build_binary
2025-05-07T20:28:24.7120271Z ################################################################################
2025-05-07T20:28:24.7120487Z 
2025-05-07T20:28:24.7120715Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:24.7121153Z ################################################################################
2025-05-07T20:28:24.7121519Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:24.7121851Z #
2025-05-07T20:28:24.7136511Z # [2025-05-07T20:28:24.713Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:24.7137042Z ################################################################################
2025-05-07T20:28:24.7137261Z 
2025-05-07T20:28:24.7151985Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:24.8070483Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:24.8071070Z ################################################################################
2025-05-07T20:28:24.8071425Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:24.8071717Z #
2025-05-07T20:28:24.8088952Z # [2025-05-07T20:28:24.808Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:24.8089445Z ################################################################################
2025-05-07T20:28:24.8089666Z 
2025-05-07T20:28:24.8136147Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:24.8152799Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:24.8153324Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:24.8161918Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:24.8171130Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:24.8192190Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:32.4891601Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:28:32.4893099Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:28:32.4893853Z 
2025-05-07T20:28:32.4894071Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:32.4894483Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:32.4895286Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:28:32.4896515Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:28:32.4897613Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 55.8 MB/s eta 0:00:00
2025-05-07T20:28:32.4897991Z Installing collected packages: pytorch-triton
2025-05-07T20:28:32.4898428Z   Attempting uninstall: pytorch-triton
2025-05-07T20:28:32.4898824Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:28:32.4899250Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:28:32.4899672Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:28:32.4900113Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:28:32.4900372Z 
2025-05-07T20:28:34.6804421Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:28:34.6809224Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:28:36.8235360Z ################################################################################
2025-05-07T20:28:36.8235860Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:28:36.8236254Z ################################################################################
2025-05-07T20:28:36.8236475Z 
2025-05-07T20:28:38.8643149Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:28:40.9758048Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:28:40.9762052Z [BUILD] Successfully ran git submodules update
2025-05-07T20:28:40.9819896Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:28:40.9820388Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:28:40.9832939Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:40.9833292Z env:
2025-05-07T20:28:40.9833527Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:40.9833832Z   BUILD_ENV: build_binary
2025-05-07T20:28:40.9834087Z   BUILD_TARGET: genai
2025-05-07T20:28:40.9834327Z   BUILD_VARIANT: cuda
2025-05-07T20:28:40.9834571Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:28:40.9834830Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:40.9835141Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:40.9835487Z ##[endgroup]
2025-05-07T20:28:41.3225628Z ################################################################################
2025-05-07T20:28:41.3226151Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:28:41.3226520Z #
2025-05-07T20:28:41.3243274Z # [2025-05-07T20:28:41.323Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.3244244Z ################################################################################
2025-05-07T20:28:41.3244460Z 
2025-05-07T20:28:41.3244840Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.3245765Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.3246109Z 
2025-05-07T20:28:41.3365181Z 8ba3834acd41ae3bcccd6bc3808c6265641c1772  fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.3367906Z 
2025-05-07T20:28:41.3368387Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.3368743Z 
2025-05-07T20:28:41.3503510Z f663bfaa41e2d494994aba32b6056b326d1d4de603cd7405849022b0c68c5a6f  fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.3506029Z 
2025-05-07T20:28:41.3506522Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.3506876Z 
2025-05-07T20:28:41.3735314Z b3c9062203e47ff2273663b0f7d0fbee  fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:41.3737959Z 
2025-05-07T20:28:41.3747701Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl ...
2025-05-07T20:28:41.3770100Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:44.1094953Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:28:44.1095931Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:28:44.1096797Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:28:44.1097254Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:28:44.1097534Z 
2025-05-07T20:28:51.0088645Z ################################################################################
2025-05-07T20:28:51.0089070Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:28:51.0089468Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126
2025-05-07T20:28:51.0089906Z [CHECK] CUDA version reported by PyTorch is: 12.6
2025-05-07T20:28:51.0090223Z [CHECK]
2025-05-07T20:28:51.0090559Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:28:51.0091102Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:28:51.0091527Z ################################################################################
2025-05-07T20:28:51.0091739Z 
2025-05-07T20:28:51.0091859Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:28:54.9418621Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:28:58.8634798Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:02.7912256Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:02.7915430Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:14.5766382Z ################################################################################
2025-05-07T20:29:14.5768871Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:14.5769479Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:14.5769947Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:14.5770316Z ################################################################################
2025-05-07T20:29:14.5770535Z 
2025-05-07T20:29:22.4328151Z ################################################################################
2025-05-07T20:29:22.4328564Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:22.4329963Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:22.4331908Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:22.4332430Z ################################################################################
2025-05-07T20:29:22.4332656Z 
2025-05-07T20:29:22.4332815Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:29:26.3632913Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:29:30.2915413Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:29:34.3705737Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:29:38.3008407Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:29:38.3012694Z [INSTALL] Check for operator registrations ...
2025-05-07T20:29:42.1686239Z fbgemm.nccl_init
2025-05-07T20:29:42.1686425Z 
2025-05-07T20:29:42.2299613Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:29:46.0877235Z fbgemm.gqa_attn_splitk
2025-05-07T20:29:46.0877528Z 
2025-05-07T20:29:46.1494019Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:29:50.0223229Z fbgemm.rope_qkv_decoding
2025-05-07T20:29:50.0223546Z 
2025-05-07T20:29:50.0838844Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:29:50.0839480Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:29:50.0874662Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:29:50.0875135Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:29:50.0887379Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:50.0887734Z env:
2025-05-07T20:29:50.0887965Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:50.0888268Z   BUILD_ENV: build_binary
2025-05-07T20:29:50.0888520Z   BUILD_TARGET: genai
2025-05-07T20:29:50.0888756Z   BUILD_VARIANT: cuda
2025-05-07T20:29:50.0888990Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:50.0889251Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:50.0889559Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:50.0889889Z ##[endgroup]
2025-05-07T20:29:50.4233060Z ################################################################################
2025-05-07T20:29:50.4233442Z # Test All FBGEMM-GPU Modules
2025-05-07T20:29:50.4233699Z #
2025-05-07T20:29:50.4248412Z # [2025-05-07T20:29:50.424Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:29:50.4248823Z ################################################################################
2025-05-07T20:29:50.4249049Z 
2025-05-07T20:29:58.2711909Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:29:58.2712689Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:29:58.2713099Z [TEST] Determined the test directories:
2025-05-07T20:29:58.2713424Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:29:58.2713729Z fbgemm_gpu/experimental/example/test
2025-05-07T20:29:58.2714038Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:29:58.2714238Z 
2025-05-07T20:29:58.2722013Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:29:58.2729200Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:29:58.2729817Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:29:58.2730226Z 
2025-05-07T20:29:58.6959671Z 
2025-05-07T20:29:58.6960075Z [TEST] Installing PyTest ...
2025-05-07T20:29:58.6982416Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:29:59.8027845Z Channels:
2025-05-07T20:29:59.8028101Z  - conda-forge
2025-05-07T20:29:59.8028336Z Platform: linux-64
2025-05-07T20:30:03.1365898Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:04.2789409Z Solving environment: \ | / done
2025-05-07T20:30:04.5055116Z 
2025-05-07T20:30:04.5055514Z ## Package Plan ##
2025-05-07T20:30:04.5056015Z 
2025-05-07T20:30:04.5056326Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:04.5056753Z 
2025-05-07T20:30:04.5056886Z   added / updated specs:
2025-05-07T20:30:04.5057211Z     - expecttest
2025-05-07T20:30:04.5057430Z     - pytest
2025-05-07T20:30:04.5057566Z 
2025-05-07T20:30:04.5057570Z 
2025-05-07T20:30:04.5057694Z The following packages will be downloaded:
2025-05-07T20:30:04.5057942Z 
2025-05-07T20:30:04.5058060Z     package                    |            build
2025-05-07T20:30:04.5058492Z     ---------------------------|-----------------
2025-05-07T20:30:04.5058927Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:04.5059608Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:04.5060249Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:04.5060701Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:04.5061136Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:04.5061565Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:04.5061984Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:04.5062761Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:04.5063158Z     ------------------------------------------------------------
2025-05-07T20:30:04.5063502Z                                            Total:         428 KB
2025-05-07T20:30:04.5063718Z 
2025-05-07T20:30:04.5063856Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:04.5064092Z 
2025-05-07T20:30:04.5064298Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:04.5064813Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:04.5065342Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:04.5065829Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:04.5066302Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:04.5066929Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:04.5067574Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:04.5068165Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:04.5068433Z 
2025-05-07T20:30:04.5068437Z 
2025-05-07T20:30:04.5068446Z 
2025-05-07T20:30:04.5068595Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:04.5068980Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:04.5069218Z 
2025-05-07T20:30:04.5069619Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:04.5069857Z 
2025-05-07T20:30:04.5069861Z 
2025-05-07T20:30:04.5086927Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:04.5087281Z 
2025-05-07T20:30:04.5087287Z 
2025-05-07T20:30:04.5087292Z 
2025-05-07T20:30:04.5096653Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:04.5097037Z 
2025-05-07T20:30:04.5097044Z 
2025-05-07T20:30:04.5097050Z 
2025-05-07T20:30:04.5097066Z 
2025-05-07T20:30:04.5104600Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:04.5105102Z 
2025-05-07T20:30:04.5105110Z 
2025-05-07T20:30:04.5105116Z 
2025-05-07T20:30:04.5105123Z 
2025-05-07T20:30:04.5105140Z 
2025-05-07T20:30:04.5106049Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:04.5106340Z 
2025-05-07T20:30:04.5106344Z 
2025-05-07T20:30:04.5106347Z 
2025-05-07T20:30:04.5106351Z 
2025-05-07T20:30:04.5106362Z 
2025-05-07T20:30:04.5110722Z 
2025-05-07T20:30:04.5112544Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:04.5112906Z 
2025-05-07T20:30:04.5112912Z 
2025-05-07T20:30:04.5112925Z 
2025-05-07T20:30:04.5112931Z 
2025-05-07T20:30:04.5112936Z 
2025-05-07T20:30:04.5112942Z 
2025-05-07T20:30:04.5121201Z 
2025-05-07T20:30:04.7190619Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:04.7190917Z 
2025-05-07T20:30:04.7190921Z 
2025-05-07T20:30:04.7190925Z 
2025-05-07T20:30:04.7193608Z 
2025-05-07T20:30:04.7240811Z exceptiongroup-1.2.2 | 20 KB     | #######9   |  80% [A[A[A[A
2025-05-07T20:30:04.7241113Z 
2025-05-07T20:30:04.7241117Z 
2025-05-07T20:30:04.7241121Z 
2025-05-07T20:30:04.7241124Z 
2025-05-07T20:30:04.8082033Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:04.8082338Z 
2025-05-07T20:30:04.8082342Z 
2025-05-07T20:30:04.8297839Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:30:04.8298179Z 
2025-05-07T20:30:04.8298183Z 
2025-05-07T20:30:04.8298194Z 
2025-05-07T20:30:04.8298198Z 
2025-05-07T20:30:04.8298201Z 
2025-05-07T20:30:04.8564778Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:04.8565053Z 
2025-05-07T20:30:04.8569463Z 
2025-05-07T20:30:04.8582698Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:04.8582964Z 
2025-05-07T20:30:04.8582968Z 
2025-05-07T20:30:04.8582972Z 
2025-05-07T20:30:04.8611515Z pluggy-1.5.0         | 23 KB     | ######9    |  69% [A[A[A
2025-05-07T20:30:04.8611778Z 
2025-05-07T20:30:04.8612021Z 
2025-05-07T20:30:04.8612027Z 
2025-05-07T20:30:04.8612030Z 
2025-05-07T20:30:04.8612034Z 
2025-05-07T20:30:04.8700278Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:04.8700554Z 
2025-05-07T20:30:04.8700558Z 
2025-05-07T20:30:04.8700562Z 
2025-05-07T20:30:04.8921192Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:04.8921455Z 
2025-05-07T20:30:04.8921459Z 
2025-05-07T20:30:04.8921463Z 
2025-05-07T20:30:04.8921467Z 
2025-05-07T20:30:04.8959146Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:04.8959441Z 
2025-05-07T20:30:04.8959445Z 
2025-05-07T20:30:04.8959449Z 
2025-05-07T20:30:04.8959452Z 
2025-05-07T20:30:04.8959456Z 
2025-05-07T20:30:04.9014051Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:04.9014558Z 
2025-05-07T20:30:04.9014566Z 
2025-05-07T20:30:04.9067567Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:04.9067826Z 
2025-05-07T20:30:04.9067840Z 
2025-05-07T20:30:04.9067844Z 
2025-05-07T20:30:04.9131379Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:04.9131647Z 
2025-05-07T20:30:04.9131650Z 
2025-05-07T20:30:04.9131654Z 
2025-05-07T20:30:04.9131658Z 
2025-05-07T20:30:04.9131661Z 
2025-05-07T20:30:04.9131665Z 
2025-05-07T20:30:04.9134810Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:04.9135101Z 
2025-05-07T20:30:04.9135105Z 
2025-05-07T20:30:04.9135109Z 
2025-05-07T20:30:04.9135112Z 
2025-05-07T20:30:04.9135116Z 
2025-05-07T20:30:04.9135119Z 
2025-05-07T20:30:04.9201844Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:04.9202131Z 
2025-05-07T20:30:04.9202135Z 
2025-05-07T20:30:04.9202138Z 
2025-05-07T20:30:04.9202142Z 
2025-05-07T20:30:04.9202145Z 
2025-05-07T20:30:04.9202149Z 
2025-05-07T20:30:04.9228123Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:04.9272560Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:30:04.9357131Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:04.9357375Z 
2025-05-07T20:30:04.9357588Z 
2025-05-07T20:30:04.9357592Z 
2025-05-07T20:30:04.9357596Z 
2025-05-07T20:30:04.9357599Z 
2025-05-07T20:30:04.9357603Z 
2025-05-07T20:30:04.9357615Z 
2025-05-07T20:30:04.9362469Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:04.9362819Z 
2025-05-07T20:30:04.9362826Z 
2025-05-07T20:30:04.9362831Z 
2025-05-07T20:30:04.9362844Z 
2025-05-07T20:30:04.9362850Z 
2025-05-07T20:30:04.9362856Z 
2025-05-07T20:30:04.9362861Z 
2025-05-07T20:30:04.9513451Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:04.9516953Z 
2025-05-07T20:30:04.9535931Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:30:04.9536264Z 
2025-05-07T20:30:04.9536270Z 
2025-05-07T20:30:04.9536275Z 
2025-05-07T20:30:04.9536280Z 
2025-05-07T20:30:04.9536285Z 
2025-05-07T20:30:04.9536309Z 
2025-05-07T20:30:04.9536499Z 
2025-05-07T20:30:04.9540888Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:04.9541331Z 
2025-05-07T20:30:04.9697169Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:04.9703154Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:04.9703400Z 
2025-05-07T20:30:04.9709969Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:04.9710370Z                                                      
2025-05-07T20:30:04.9710574Z 
2025-05-07T20:30:04.9710737Z                                                      [A
2025-05-07T20:30:04.9710944Z 
2025-05-07T20:30:04.9710949Z 
2025-05-07T20:30:04.9711113Z                                                      [A[A
2025-05-07T20:30:04.9711324Z 
2025-05-07T20:30:04.9711327Z 
2025-05-07T20:30:04.9711331Z 
2025-05-07T20:30:04.9711495Z                                                      [A[A[A
2025-05-07T20:30:04.9711907Z 
2025-05-07T20:30:04.9711913Z 
2025-05-07T20:30:04.9711925Z 
2025-05-07T20:30:04.9711929Z 
2025-05-07T20:30:04.9712115Z                                                      [A[A[A[A
2025-05-07T20:30:04.9712332Z 
2025-05-07T20:30:04.9712335Z 
2025-05-07T20:30:04.9712339Z 
2025-05-07T20:30:04.9712348Z 
2025-05-07T20:30:04.9712352Z 
2025-05-07T20:30:04.9712525Z                                                      [A[A[A[A[A
2025-05-07T20:30:04.9712735Z 
2025-05-07T20:30:04.9712739Z 
2025-05-07T20:30:04.9712743Z 
2025-05-07T20:30:04.9712746Z 
2025-05-07T20:30:04.9712750Z 
2025-05-07T20:30:04.9712759Z 
2025-05-07T20:30:04.9712934Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:04.9713146Z 
2025-05-07T20:30:04.9713150Z 
2025-05-07T20:30:04.9713153Z 
2025-05-07T20:30:04.9713157Z 
2025-05-07T20:30:04.9713160Z 
2025-05-07T20:30:04.9713164Z 
2025-05-07T20:30:04.9713174Z 
2025-05-07T20:30:04.9713381Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:05.0715475Z Preparing transaction: \ done
2025-05-07T20:30:05.1720852Z Verifying transaction: / done
2025-05-07T20:30:07.0749049Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:30:07.2014502Z [TEST] Checking imports ...
2025-05-07T20:30:11.1154575Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:11.1168605Z [TEST] Setting feature flags ...
2025-05-07T20:30:11.1169375Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:11.1169766Z 
2025-05-07T20:30:11.5374854Z 
2025-05-07T20:30:11.5375180Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:11.5377067Z ################################################################################
2025-05-07T20:30:11.5377811Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:11.5378397Z #
2025-05-07T20:30:11.5395978Z # [2025-05-07T20:30:11.539Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:11.5396521Z ################################################################################
2025-05-07T20:30:11.5396979Z 
2025-05-07T20:30:11.5403774Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:11.5432405Z ./attention/gqa_test.py
2025-05-07T20:30:11.5432809Z ./coalesce/coalesce_test.py
2025-05-07T20:30:11.5433199Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:11.5433629Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:11.5434007Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:11.5434263Z ./moe/activation_test.py
2025-05-07T20:30:11.5434621Z ./moe/gather_scatter_test.py
2025-05-07T20:30:11.5435000Z ./moe/layers_test.py
2025-05-07T20:30:11.5435327Z ./moe/shuffling_test.py
2025-05-07T20:30:11.5435679Z ./quantize/quantize_test.py
2025-05-07T20:30:11.5435919Z 
2025-05-07T20:30:11.5436081Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:11.5436301Z 
2025-05-07T20:30:11.5453404Z ################################################################################
2025-05-07T20:30:11.5468671Z # [2025-05-07T20:30:11.546Z] Run Python Test Suite:
2025-05-07T20:30:11.5469094Z #   ./attention/gqa_test.py
2025-05-07T20:30:11.5469416Z ################################################################################
2025-05-07T20:30:11.5492734Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:11.5493357Z 
2025-05-07T20:30:14.0870861Z ============================= test session starts ==============================
2025-05-07T20:30:14.0871524Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:14.0872073Z cachedir: .pytest_cache
2025-05-07T20:30:14.0872717Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:14.0873815Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:14.0874243Z plugins: hypothesis-6.131.14
2025-05-07T20:30:15.6206672Z collecting ... collected 2 items
2025-05-07T20:30:15.6207037Z 
2025-05-07T20:30:53.9325215Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:30:53.9325874Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9326284Z     int4_kv=False,
2025-05-07T20:30:53.9326556Z     num_groups=1,
2025-05-07T20:30:53.9326825Z     B=1,
2025-05-07T20:30:53.9327073Z     MAX_T=4,
2025-05-07T20:30:53.9327321Z     N_H_L=1,
2025-05-07T20:30:53.9327567Z )
2025-05-07T20:30:53.9327816Z Trying example: test_gqa(
2025-05-07T20:30:53.9328182Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9328579Z     int4_kv=True,
2025-05-07T20:30:53.9328851Z     num_groups=1,
2025-05-07T20:30:53.9329105Z     B=1,
2025-05-07T20:30:53.9329347Z     MAX_T=4,
2025-05-07T20:30:53.9329592Z     N_H_L=1,
2025-05-07T20:30:53.9329827Z )
2025-05-07T20:30:53.9330102Z Trying example: test_gqa(
2025-05-07T20:30:53.9330474Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9330864Z     int4_kv=True,
2025-05-07T20:30:53.9331142Z     num_groups=4,
2025-05-07T20:30:53.9331402Z     B=23,
2025-05-07T20:30:53.9331636Z     MAX_T=33,
2025-05-07T20:30:53.9331884Z     N_H_L=68,
2025-05-07T20:30:53.9332126Z )
2025-05-07T20:30:53.9332385Z Trying example: test_gqa(
2025-05-07T20:30:53.9332784Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9333179Z     int4_kv=True,
2025-05-07T20:30:53.9333442Z     num_groups=4,
2025-05-07T20:30:53.9333694Z     B=77,
2025-05-07T20:30:53.9333926Z     MAX_T=4,
2025-05-07T20:30:53.9334169Z     N_H_L=1,
2025-05-07T20:30:53.9334402Z )
2025-05-07T20:30:53.9334648Z Trying example: test_gqa(
2025-05-07T20:30:53.9335022Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9335418Z     int4_kv=True,
2025-05-07T20:30:53.9335688Z     num_groups=4,
2025-05-07T20:30:53.9335950Z     B=77,
2025-05-07T20:30:53.9336190Z     MAX_T=52,
2025-05-07T20:30:53.9336444Z     N_H_L=67,
2025-05-07T20:30:53.9336695Z )
2025-05-07T20:30:53.9336938Z Trying example: test_gqa(
2025-05-07T20:30:53.9337723Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9338261Z     int4_kv=False,
2025-05-07T20:30:53.9338526Z     num_groups=4,
2025-05-07T20:30:53.9338789Z     B=57,
2025-05-07T20:30:53.9339029Z     MAX_T=45,
2025-05-07T20:30:53.9339276Z     N_H_L=120,
2025-05-07T20:30:53.9339527Z )
2025-05-07T20:30:53.9339775Z Trying example: test_gqa(
2025-05-07T20:30:53.9340137Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9340534Z     int4_kv=True,
2025-05-07T20:30:53.9340803Z     num_groups=4,
2025-05-07T20:30:53.9341054Z     B=52,
2025-05-07T20:30:53.9341296Z     MAX_T=42,
2025-05-07T20:30:53.9341543Z     N_H_L=53,
2025-05-07T20:30:53.9341781Z )
2025-05-07T20:30:53.9342031Z Trying example: test_gqa(
2025-05-07T20:30:53.9342404Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9342846Z     int4_kv=True,
2025-05-07T20:30:53.9343109Z     num_groups=1,
2025-05-07T20:30:53.9343366Z     B=77,
2025-05-07T20:30:53.9343615Z     MAX_T=95,
2025-05-07T20:30:53.9343857Z     N_H_L=53,
2025-05-07T20:30:53.9344106Z )
2025-05-07T20:30:53.9344353Z Trying example: test_gqa(
2025-05-07T20:30:53.9344714Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9345111Z     int4_kv=True,
2025-05-07T20:30:53.9345378Z     num_groups=4,
2025-05-07T20:30:53.9345635Z     B=113,
2025-05-07T20:30:53.9345882Z     MAX_T=48,
2025-05-07T20:30:53.9346137Z     N_H_L=96,
2025-05-07T20:30:53.9346383Z )
2025-05-07T20:30:53.9346632Z Trying example: test_gqa(
2025-05-07T20:30:53.9347005Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9347396Z     int4_kv=False,
2025-05-07T20:30:53.9347663Z     num_groups=1,
2025-05-07T20:30:53.9347925Z     B=51,
2025-05-07T20:30:53.9348153Z     MAX_T=61,
2025-05-07T20:30:53.9348410Z     N_H_L=69,
2025-05-07T20:30:53.9348870Z )
2025-05-07T20:30:53.9349121Z Trying example: test_gqa(
2025-05-07T20:30:53.9349498Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9349898Z     int4_kv=False,
2025-05-07T20:30:53.9350163Z     num_groups=4,
2025-05-07T20:30:53.9350424Z     B=17,
2025-05-07T20:30:53.9350662Z     MAX_T=113,
2025-05-07T20:30:53.9350911Z     N_H_L=65,
2025-05-07T20:30:53.9351162Z )
2025-05-07T20:30:53.9351407Z Trying example: test_gqa(
2025-05-07T20:30:53.9351766Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9352212Z     int4_kv=False,
2025-05-07T20:30:53.9352481Z     num_groups=4,
2025-05-07T20:30:53.9352747Z     B=17,
2025-05-07T20:30:53.9352979Z     MAX_T=65,
2025-05-07T20:30:53.9353227Z     N_H_L=65,
2025-05-07T20:30:53.9353470Z )
2025-05-07T20:30:53.9353709Z Trying example: test_gqa(
2025-05-07T20:30:53.9354080Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9354478Z     int4_kv=False,
2025-05-07T20:30:53.9354743Z     num_groups=4,
2025-05-07T20:30:53.9355013Z     B=65,
2025-05-07T20:30:53.9355250Z     MAX_T=65,
2025-05-07T20:30:53.9355494Z     N_H_L=65,
2025-05-07T20:30:53.9356009Z )
2025-05-07T20:30:53.9356263Z Trying example: test_gqa(
2025-05-07T20:30:53.9356624Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9357019Z     int4_kv=False,
2025-05-07T20:30:53.9357285Z     num_groups=1,
2025-05-07T20:30:53.9357542Z     B=6,
2025-05-07T20:30:53.9357779Z     MAX_T=108,
2025-05-07T20:30:53.9358028Z     N_H_L=14,
2025-05-07T20:30:53.9358264Z )
2025-05-07T20:30:53.9358509Z Trying example: test_gqa(
2025-05-07T20:30:53.9358875Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9359267Z     int4_kv=False,
2025-05-07T20:30:53.9359535Z     num_groups=1,
2025-05-07T20:30:53.9359797Z     B=6,
2025-05-07T20:30:53.9360030Z     MAX_T=14,
2025-05-07T20:30:53.9360280Z     N_H_L=14,
2025-05-07T20:30:53.9360534Z )
2025-05-07T20:30:53.9360773Z Trying example: test_gqa(
2025-05-07T20:30:53.9361151Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9361550Z     int4_kv=False,
2025-05-07T20:30:53.9361819Z     num_groups=1,
2025-05-07T20:30:53.9362074Z     B=6,
2025-05-07T20:30:53.9362496Z     MAX_T=6,
2025-05-07T20:30:53.9362765Z     N_H_L=14,
2025-05-07T20:30:53.9363004Z )
2025-05-07T20:30:53.9363246Z Trying example: test_gqa(
2025-05-07T20:30:53.9363611Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9363998Z     int4_kv=False,
2025-05-07T20:30:53.9364263Z     num_groups=1,
2025-05-07T20:30:53.9364525Z     B=6,
2025-05-07T20:30:53.9364751Z     MAX_T=6,
2025-05-07T20:30:53.9365001Z     N_H_L=6,
2025-05-07T20:30:53.9365239Z )
2025-05-07T20:30:53.9365506Z Trying example: test_gqa(
2025-05-07T20:30:53.9365873Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9366277Z     int4_kv=False,
2025-05-07T20:30:53.9366553Z     num_groups=1,
2025-05-07T20:30:53.9366812Z     B=70,
2025-05-07T20:30:53.9367056Z     MAX_T=94,
2025-05-07T20:30:53.9367305Z     N_H_L=78,
2025-05-07T20:30:53.9367546Z )
2025-05-07T20:30:53.9367800Z Trying example: test_gqa(
2025-05-07T20:30:53.9368176Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9368573Z     int4_kv=False,
2025-05-07T20:30:53.9368847Z     num_groups=1,
2025-05-07T20:30:53.9369112Z     B=78,
2025-05-07T20:30:53.9369347Z     MAX_T=94,
2025-05-07T20:30:53.9369600Z     N_H_L=78,
2025-05-07T20:30:53.9369842Z )
2025-05-07T20:30:53.9370088Z Trying example: test_gqa(
2025-05-07T20:30:53.9370457Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9370855Z     int4_kv=False,
2025-05-07T20:30:53.9371120Z     num_groups=1,
2025-05-07T20:30:53.9371378Z     B=94,
2025-05-07T20:30:53.9371622Z     MAX_T=94,
2025-05-07T20:30:53.9371861Z     N_H_L=78,
2025-05-07T20:30:53.9372111Z )
2025-05-07T20:30:53.9372363Z Trying example: test_gqa(
2025-05-07T20:30:53.9372780Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9373176Z     int4_kv=False,
2025-05-07T20:30:53.9373583Z     num_groups=1,
2025-05-07T20:30:53.9373843Z     B=94,
2025-05-07T20:30:53.9374082Z     MAX_T=94,
2025-05-07T20:30:53.9374335Z     N_H_L=94,
2025-05-07T20:30:53.9374611Z )
2025-05-07T20:30:53.9374820Z Trying example: test_gqa(
2025-05-07T20:30:53.9375124Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9375441Z     int4_kv=False,
2025-05-07T20:30:53.9375657Z     num_groups=4,
2025-05-07T20:30:53.9375868Z     B=41,
2025-05-07T20:30:53.9376066Z     MAX_T=105,
2025-05-07T20:30:53.9376271Z     N_H_L=126,
2025-05-07T20:30:53.9376473Z )
2025-05-07T20:30:53.9376672Z Trying example: test_gqa(
2025-05-07T20:30:53.9376972Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9377292Z     int4_kv=False,
2025-05-07T20:30:53.9377508Z     num_groups=4,
2025-05-07T20:30:53.9377714Z     B=105,
2025-05-07T20:30:53.9377909Z     MAX_T=105,
2025-05-07T20:30:53.9378187Z     N_H_L=126,
2025-05-07T20:30:53.9378384Z )
2025-05-07T20:30:53.9378592Z Trying example: test_gqa(
2025-05-07T20:30:53.9378908Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9379225Z     int4_kv=False,
2025-05-07T20:30:53.9379452Z     num_groups=4,
2025-05-07T20:30:53.9379672Z     B=105,
2025-05-07T20:30:53.9379869Z     MAX_T=105,
2025-05-07T20:30:53.9380082Z     N_H_L=105,
2025-05-07T20:30:53.9380288Z )
2025-05-07T20:30:53.9380486Z Trying example: test_gqa(
2025-05-07T20:30:53.9380792Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9381120Z     int4_kv=True,
2025-05-07T20:30:53.9381333Z     num_groups=1,
2025-05-07T20:30:53.9381547Z     B=95,
2025-05-07T20:30:53.9381752Z     MAX_T=114,
2025-05-07T20:30:53.9381967Z     N_H_L=43,
2025-05-07T20:30:53.9382168Z )
2025-05-07T20:30:53.9382372Z Trying example: test_gqa(
2025-05-07T20:30:53.9382680Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9382995Z     int4_kv=True,
2025-05-07T20:30:53.9383210Z     num_groups=1,
2025-05-07T20:30:53.9383418Z     B=43,
2025-05-07T20:30:53.9383606Z     MAX_T=114,
2025-05-07T20:30:53.9383818Z     N_H_L=43,
2025-05-07T20:30:53.9384015Z )
2025-05-07T20:30:53.9384215Z Trying example: test_gqa(
2025-05-07T20:30:53.9384543Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9384994Z     int4_kv=True,
2025-05-07T20:30:53.9385216Z     num_groups=1,
2025-05-07T20:30:53.9385430Z     B=43,
2025-05-07T20:30:53.9385625Z     MAX_T=43,
2025-05-07T20:30:53.9385831Z     N_H_L=43,
2025-05-07T20:30:53.9386020Z )
2025-05-07T20:30:53.9386223Z Trying example: test_gqa(
2025-05-07T20:30:53.9386528Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9386843Z     int4_kv=False,
2025-05-07T20:30:53.9387064Z     num_groups=1,
2025-05-07T20:30:53.9387273Z     B=21,
2025-05-07T20:30:53.9387458Z     MAX_T=38,
2025-05-07T20:30:53.9387657Z     N_H_L=42,
2025-05-07T20:30:53.9387859Z )
2025-05-07T20:30:53.9388050Z Trying example: test_gqa(
2025-05-07T20:30:53.9388355Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9388683Z     int4_kv=False,
2025-05-07T20:30:53.9388897Z     num_groups=1,
2025-05-07T20:30:53.9389108Z     B=38,
2025-05-07T20:30:53.9389308Z     MAX_T=38,
2025-05-07T20:30:53.9389515Z     N_H_L=42,
2025-05-07T20:30:53.9389721Z )
2025-05-07T20:30:53.9389924Z Trying example: test_gqa(
2025-05-07T20:30:53.9390222Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9390544Z     int4_kv=False,
2025-05-07T20:30:53.9390770Z     num_groups=1,
2025-05-07T20:30:53.9390987Z     B=38,
2025-05-07T20:30:53.9391176Z     MAX_T=42,
2025-05-07T20:30:53.9391385Z     N_H_L=42,
2025-05-07T20:30:53.9391587Z )
2025-05-07T20:30:53.9391778Z Trying example: test_gqa(
2025-05-07T20:30:53.9392085Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9392418Z     int4_kv=False,
2025-05-07T20:30:53.9392634Z     num_groups=1,
2025-05-07T20:30:53.9392851Z     B=42,
2025-05-07T20:30:53.9393087Z     MAX_T=42,
2025-05-07T20:30:53.9393303Z     N_H_L=42,
2025-05-07T20:30:53.9393511Z )
2025-05-07T20:30:53.9393814Z Trying example: test_gqa(
2025-05-07T20:30:53.9394113Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9394437Z     int4_kv=True,
2025-05-07T20:30:53.9394660Z     num_groups=1,
2025-05-07T20:30:53.9394865Z     B=74,
2025-05-07T20:30:53.9395064Z     MAX_T=20,
2025-05-07T20:30:53.9395270Z     N_H_L=15,
2025-05-07T20:30:53.9395459Z )
2025-05-07T20:30:53.9395665Z Trying example: test_gqa(
2025-05-07T20:30:53.9395975Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9396290Z     int4_kv=True,
2025-05-07T20:30:53.9396520Z     num_groups=1,
2025-05-07T20:30:53.9396742Z     B=20,
2025-05-07T20:30:53.9396934Z     MAX_T=20,
2025-05-07T20:30:53.9397149Z     N_H_L=15,
2025-05-07T20:30:53.9397360Z )
2025-05-07T20:30:53.9397558Z Trying example: test_gqa(
2025-05-07T20:30:53.9397869Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9398200Z     int4_kv=True,
2025-05-07T20:30:53.9398424Z     num_groups=1,
2025-05-07T20:30:53.9398630Z     B=20,
2025-05-07T20:30:53.9398835Z     MAX_T=15,
2025-05-07T20:30:53.9399041Z     N_H_L=15,
2025-05-07T20:30:53.9399239Z )
2025-05-07T20:30:53.9399439Z Trying example: test_gqa(
2025-05-07T20:30:53.9399741Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9400057Z     int4_kv=True,
2025-05-07T20:30:53.9400273Z     num_groups=1,
2025-05-07T20:30:53.9400487Z     B=15,
2025-05-07T20:30:53.9400678Z     MAX_T=20,
2025-05-07T20:30:53.9400882Z     N_H_L=15,
2025-05-07T20:30:53.9401084Z )
2025-05-07T20:30:53.9401280Z Trying example: test_gqa(
2025-05-07T20:30:53.9401589Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9401909Z     int4_kv=True,
2025-05-07T20:30:53.9402117Z     num_groups=1,
2025-05-07T20:30:53.9402337Z     B=15,
2025-05-07T20:30:53.9402531Z     MAX_T=15,
2025-05-07T20:30:53.9402735Z     N_H_L=15,
2025-05-07T20:30:53.9402970Z )
2025-05-07T20:30:53.9403191Z Trying example: test_gqa(
2025-05-07T20:30:53.9403489Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9403819Z     int4_kv=False,
2025-05-07T20:30:53.9404044Z     num_groups=4,
2025-05-07T20:30:53.9404258Z     B=117,
2025-05-07T20:30:53.9404459Z     MAX_T=104,
2025-05-07T20:30:53.9404756Z     N_H_L=69,
2025-05-07T20:30:53.9404955Z )
2025-05-07T20:30:53.9405155Z Trying example: test_gqa(
2025-05-07T20:30:53.9405460Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9405784Z     int4_kv=False,
2025-05-07T20:30:53.9405994Z     num_groups=4,
2025-05-07T20:30:53.9406208Z     B=117,
2025-05-07T20:30:53.9406408Z     MAX_T=117,
2025-05-07T20:30:53.9406607Z     N_H_L=69,
2025-05-07T20:30:53.9406805Z )
2025-05-07T20:30:53.9407007Z Trying example: test_gqa(
2025-05-07T20:30:53.9407305Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9407629Z     int4_kv=False,
2025-05-07T20:30:53.9407847Z     num_groups=4,
2025-05-07T20:30:53.9408054Z     B=69,
2025-05-07T20:30:53.9408251Z     MAX_T=117,
2025-05-07T20:30:53.9408453Z     N_H_L=69,
2025-05-07T20:30:53.9408643Z )
2025-05-07T20:30:53.9408852Z Trying example: test_gqa(
2025-05-07T20:30:53.9409149Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:30:53.9409464Z     int4_kv=False,
2025-05-07T20:30:53.9409691Z     num_groups=4,
2025-05-07T20:30:53.9409897Z     B=117,
2025-05-07T20:30:53.9410092Z     MAX_T=69,
2025-05-07T20:30:53.9410301Z     N_H_L=69,
2025-05-07T20:30:53.9410497Z )
2025-05-07T20:30:53.9410685Z PASSED
2025-05-07T20:30:53.9677860Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:30:53.9678209Z 
2025-05-07T20:30:53.9678372Z =========================== short test summary info ============================
2025-05-07T20:30:53.9679097Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when CUDA is not available or xformers is not available
2025-05-07T20:30:53.9679800Z ======================== 1 passed, 1 skipped in 40.38s =========================
2025-05-07T20:30:54.6107536Z 
2025-05-07T20:30:54.6108334Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:30:54.6130129Z [TEST] Python test time for ./attention/gqa_test.py: 43 seconds
2025-05-07T20:30:54.6130529Z 
2025-05-07T20:30:54.6130536Z 
2025-05-07T20:30:54.6130541Z 
2025-05-07T20:30:54.6130547Z 
2025-05-07T20:30:54.6150818Z ################################################################################
2025-05-07T20:30:54.6166328Z # [2025-05-07T20:30:54.616Z] Run Python Test Suite:
2025-05-07T20:30:54.6166808Z #   ./coalesce/coalesce_test.py
2025-05-07T20:30:54.6167117Z ################################################################################
2025-05-07T20:30:54.6192372Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:30:54.6193201Z 
2025-05-07T20:30:56.7637852Z ============================= test session starts ==============================
2025-05-07T20:30:56.7638599Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:56.7639137Z cachedir: .pytest_cache
2025-05-07T20:30:56.7639736Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:56.7640476Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:56.7640885Z plugins: hypothesis-6.131.14
2025-05-07T20:30:58.3250053Z collecting ... collected 1 item
2025-05-07T20:30:58.3250316Z 
2025-05-07T20:30:59.0547702Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:30:59.0548060Z 
2025-05-07T20:30:59.0548214Z ============================== 1 passed in 2.41s ===============================
2025-05-07T20:30:59.6786519Z 
2025-05-07T20:30:59.6787189Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:30:59.6807891Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:30:59.6808346Z 
2025-05-07T20:30:59.6808352Z 
2025-05-07T20:30:59.6808357Z 
2025-05-07T20:30:59.6808362Z 
2025-05-07T20:30:59.6828929Z ################################################################################
2025-05-07T20:30:59.6844807Z # [2025-05-07T20:30:59.684Z] Run Python Test Suite:
2025-05-07T20:30:59.6845301Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:30:59.6845713Z ################################################################################
2025-05-07T20:30:59.6869583Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:30:59.6870352Z 
2025-05-07T20:31:01.8244466Z ============================= test session starts ==============================
2025-05-07T20:31:01.8246091Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:01.8247159Z cachedir: .pytest_cache
2025-05-07T20:31:01.8248368Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:01.8249846Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:01.8250675Z plugins: hypothesis-6.131.14
2025-05-07T20:31:03.4167253Z collecting ... collected 5 items
2025-05-07T20:31:03.4167587Z 
2025-05-07T20:31:03.4178376Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:03.4186661Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:03.4194157Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:03.4201706Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:03.4217098Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:03.4217573Z 
2025-05-07T20:31:03.4218205Z =========================== short test summary info ============================
2025-05-07T20:31:03.4218914Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:03.4219870Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:03.4220815Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:03.4221756Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:03.4222692Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:03.4223361Z ============================== 5 skipped in 1.72s ==============================
2025-05-07T20:31:03.9807449Z 
2025-05-07T20:31:03.9808788Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:03.9828659Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds
2025-05-07T20:31:03.9829096Z 
2025-05-07T20:31:03.9829102Z 
2025-05-07T20:31:03.9829107Z 
2025-05-07T20:31:03.9829112Z 
2025-05-07T20:31:03.9850793Z ################################################################################
2025-05-07T20:31:03.9867002Z # [2025-05-07T20:31:03.986Z] Run Python Test Suite:
2025-05-07T20:31:03.9867508Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:03.9867957Z ################################################################################
2025-05-07T20:31:03.9891595Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:03.9892382Z 
2025-05-07T20:31:06.1351974Z ============================= test session starts ==============================
2025-05-07T20:31:06.1353663Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:06.1355288Z cachedir: .pytest_cache
2025-05-07T20:31:06.1356351Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:06.1357100Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:06.1357524Z plugins: hypothesis-6.131.14
2025-05-07T20:31:07.7911141Z collecting ... collected 2 items
2025-05-07T20:31:07.7911469Z 
2025-05-07T20:31:07.7922884Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:07.7937318Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:07.7937935Z 
2025-05-07T20:31:07.7938242Z =========================== short test summary info ============================
2025-05-07T20:31:07.7938899Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:07.7939782Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:07.7940394Z ============================== 2 skipped in 1.78s ==============================
2025-05-07T20:31:08.3639274Z 
2025-05-07T20:31:08.3639717Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:08.3661020Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds
2025-05-07T20:31:08.3661349Z 
2025-05-07T20:31:08.3661354Z 
2025-05-07T20:31:08.3661358Z 
2025-05-07T20:31:08.3661361Z 
2025-05-07T20:31:08.3683127Z ################################################################################
2025-05-07T20:31:08.3698936Z # [2025-05-07T20:31:08.369Z] Run Python Test Suite:
2025-05-07T20:31:08.3699517Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:08.3699818Z ################################################################################
2025-05-07T20:31:08.3723931Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:08.3724560Z 
2025-05-07T20:31:10.5099718Z ============================= test session starts ==============================
2025-05-07T20:31:10.5100526Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:10.5101068Z cachedir: .pytest_cache
2025-05-07T20:31:10.5101665Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:10.5102406Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:10.5102827Z plugins: hypothesis-6.131.14
2025-05-07T20:31:12.0900145Z collecting ... collected 4 items
2025-05-07T20:31:12.0900591Z 
2025-05-07T20:31:14.9201325Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:14.9334395Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:14.9491614Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:14.9625083Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:14.9625481Z 
2025-05-07T20:31:14.9625635Z =========================== short test summary info ============================
2025-05-07T20:31:14.9626363Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:14.9627305Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when xformers is not available
2025-05-07T20:31:14.9628003Z ============================== 4 skipped in 4.58s ==============================
2025-05-07T20:31:16.8560325Z 
2025-05-07T20:31:16.8560944Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:16.8582667Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds
2025-05-07T20:31:16.8582967Z 
2025-05-07T20:31:16.8582972Z 
2025-05-07T20:31:16.8582976Z 
2025-05-07T20:31:16.8582979Z 
2025-05-07T20:31:16.8603162Z ################################################################################
2025-05-07T20:31:16.8618318Z # [2025-05-07T20:31:16.861Z] Run Python Test Suite:
2025-05-07T20:31:16.8618976Z #   ./moe/activation_test.py
2025-05-07T20:31:16.8619307Z ################################################################################
2025-05-07T20:31:16.8645786Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:16.8646405Z 
2025-05-07T20:31:19.0025716Z ============================= test session starts ==============================
2025-05-07T20:31:19.0026377Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:19.0026922Z cachedir: .pytest_cache
2025-05-07T20:31:19.0027512Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:19.0028255Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:19.0028669Z plugins: hypothesis-6.131.14
2025-05-07T20:31:20.6468687Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:20.8247767Z collecting ... collected 2 items
2025-05-07T20:31:20.8247983Z 
2025-05-07T20:31:26.2701523Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:31:26.2702240Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2702994Z     T=1,
2025-05-07T20:31:26.2703197Z     D=5120,
2025-05-07T20:31:26.2703401Z     contiguous=True,
2025-05-07T20:31:26.2703629Z     compiled=True,
2025-05-07T20:31:26.2703861Z )
2025-05-07T20:31:26.2704071Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2704458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2704840Z     T=4096,
2025-05-07T20:31:26.2705033Z     D=5120,
2025-05-07T20:31:26.2705234Z     contiguous=True,
2025-05-07T20:31:26.2705458Z     compiled=True,
2025-05-07T20:31:26.2705667Z )
2025-05-07T20:31:26.2705875Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2706260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2706653Z     T=4096,
2025-05-07T20:31:26.2706844Z     D=7168,
2025-05-07T20:31:26.2707041Z     contiguous=False,
2025-05-07T20:31:26.2707274Z     compiled=False,
2025-05-07T20:31:26.2707485Z )
2025-05-07T20:31:26.2707686Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2708080Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2708470Z     T=4096,
2025-05-07T20:31:26.2708659Z     D=5120,
2025-05-07T20:31:26.2708865Z     contiguous=False,
2025-05-07T20:31:26.2709099Z     compiled=True,
2025-05-07T20:31:26.2709304Z )
2025-05-07T20:31:26.2709512Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2709905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2710291Z     T=1,
2025-05-07T20:31:26.2710476Z     D=7168,
2025-05-07T20:31:26.2710683Z     contiguous=True,
2025-05-07T20:31:26.2710910Z     compiled=True,
2025-05-07T20:31:26.2711116Z )
2025-05-07T20:31:26.2711320Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2711797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2712185Z     T=1,
2025-05-07T20:31:26.2712368Z     D=7168,
2025-05-07T20:31:26.2712568Z     contiguous=False,
2025-05-07T20:31:26.2712799Z     compiled=True,
2025-05-07T20:31:26.2713010Z )
2025-05-07T20:31:26.2713207Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2713587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2714133Z     T=4096,
2025-05-07T20:31:26.2714319Z     D=5120,
2025-05-07T20:31:26.2714518Z     contiguous=False,
2025-05-07T20:31:26.2714748Z     compiled=False,
2025-05-07T20:31:26.2714954Z )
2025-05-07T20:31:26.2715157Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2715539Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2715916Z     T=1,
2025-05-07T20:31:26.2716117Z     D=7168,
2025-05-07T20:31:26.2716319Z     contiguous=True,
2025-05-07T20:31:26.2716548Z     compiled=False,
2025-05-07T20:31:26.2716752Z )
2025-05-07T20:31:26.2716957Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2717337Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2717712Z     T=2048,
2025-05-07T20:31:26.2717903Z     D=5120,
2025-05-07T20:31:26.2718108Z     contiguous=True,
2025-05-07T20:31:26.2718333Z     compiled=True,
2025-05-07T20:31:26.2718541Z )
2025-05-07T20:31:26.2718744Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2719134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2719512Z     T=2048,
2025-05-07T20:31:26.2719705Z     D=7168,
2025-05-07T20:31:26.2719906Z     contiguous=True,
2025-05-07T20:31:26.2720132Z     compiled=True,
2025-05-07T20:31:26.2720342Z )
2025-05-07T20:31:26.2720545Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2720919Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2721299Z     T=2048,
2025-05-07T20:31:26.2721488Z     D=7168,
2025-05-07T20:31:26.2721682Z     contiguous=True,
2025-05-07T20:31:26.2721909Z     compiled=False,
2025-05-07T20:31:26.2722120Z )
2025-05-07T20:31:26.2722316Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2722696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2723176Z     T=128,
2025-05-07T20:31:26.2723364Z     D=5120,
2025-05-07T20:31:26.2723565Z     contiguous=False,
2025-05-07T20:31:26.2723792Z     compiled=True,
2025-05-07T20:31:26.2723997Z )
2025-05-07T20:31:26.2724200Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2724581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2724966Z     T=128,
2025-05-07T20:31:26.2725153Z     D=5120,
2025-05-07T20:31:26.2725351Z     contiguous=True,
2025-05-07T20:31:26.2725578Z     compiled=True,
2025-05-07T20:31:26.2725781Z )
2025-05-07T20:31:26.2725981Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2726362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2726738Z     T=16384,
2025-05-07T20:31:26.2726936Z     D=5120,
2025-05-07T20:31:26.2727137Z     contiguous=False,
2025-05-07T20:31:26.2727370Z     compiled=True,
2025-05-07T20:31:26.2727579Z )
2025-05-07T20:31:26.2727781Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2728168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2728554Z     T=16384,
2025-05-07T20:31:26.2728754Z     D=5120,
2025-05-07T20:31:26.2728958Z     contiguous=False,
2025-05-07T20:31:26.2729191Z     compiled=False,
2025-05-07T20:31:26.2729401Z )
2025-05-07T20:31:26.2729599Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2729981Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2730367Z     T=128,
2025-05-07T20:31:26.2730553Z     D=7168,
2025-05-07T20:31:26.2730756Z     contiguous=True,
2025-05-07T20:31:26.2730987Z     compiled=False,
2025-05-07T20:31:26.2731191Z )
2025-05-07T20:31:26.2731395Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2731778Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2732163Z     T=128,
2025-05-07T20:31:26.2732347Z     D=7168,
2025-05-07T20:31:26.2732551Z     contiguous=False,
2025-05-07T20:31:26.2732787Z     compiled=False,
2025-05-07T20:31:26.2732994Z )
2025-05-07T20:31:26.2733196Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2733578Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2734042Z     T=1,
2025-05-07T20:31:26.2734230Z     D=5120,
2025-05-07T20:31:26.2734431Z     contiguous=False,
2025-05-07T20:31:26.2734655Z     compiled=False,
2025-05-07T20:31:26.2734864Z )
2025-05-07T20:31:26.2735065Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2735439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2735819Z     T=1,
2025-05-07T20:31:26.2736008Z     D=7168,
2025-05-07T20:31:26.2736205Z     contiguous=False,
2025-05-07T20:31:26.2736434Z     compiled=False,
2025-05-07T20:31:26.2736643Z )
2025-05-07T20:31:26.2736837Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2737215Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2737596Z     T=4096,
2025-05-07T20:31:26.2737786Z     D=5120,
2025-05-07T20:31:26.2737984Z     contiguous=True,
2025-05-07T20:31:26.2738299Z     compiled=False,
2025-05-07T20:31:26.2738506Z )
2025-05-07T20:31:26.2738701Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2739087Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2739469Z     T=128,
2025-05-07T20:31:26.2739654Z     D=7168,
2025-05-07T20:31:26.2739851Z     contiguous=True,
2025-05-07T20:31:26.2740079Z     compiled=True,
2025-05-07T20:31:26.2740288Z )
2025-05-07T20:31:26.2740524Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2740918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2741295Z     T=1,
2025-05-07T20:31:26.2741483Z     D=5120,
2025-05-07T20:31:26.2741688Z     contiguous=False,
2025-05-07T20:31:26.2741912Z     compiled=True,
2025-05-07T20:31:26.2742120Z )
2025-05-07T20:31:26.2742325Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2742702Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2743431Z     T=4096,
2025-05-07T20:31:26.2743630Z     D=7168,
2025-05-07T20:31:26.2743824Z     contiguous=True,
2025-05-07T20:31:26.2744055Z     compiled=False,
2025-05-07T20:31:26.2744275Z )
2025-05-07T20:31:26.2744473Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2744863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2745248Z     T=4096,
2025-05-07T20:31:26.2745438Z     D=7168,
2025-05-07T20:31:26.2745638Z     contiguous=False,
2025-05-07T20:31:26.2745869Z     compiled=True,
2025-05-07T20:31:26.2746075Z )
2025-05-07T20:31:26.2746273Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2746655Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2747038Z     T=128,
2025-05-07T20:31:26.2747223Z     D=5120,
2025-05-07T20:31:26.2747423Z     contiguous=True,
2025-05-07T20:31:26.2747650Z     compiled=False,
2025-05-07T20:31:26.2747853Z )
2025-05-07T20:31:26.2748057Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2748446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2748823Z     T=128,
2025-05-07T20:31:26.2749017Z     D=5120,
2025-05-07T20:31:26.2749219Z     contiguous=False,
2025-05-07T20:31:26.2749447Z     compiled=False,
2025-05-07T20:31:26.2749657Z )
2025-05-07T20:31:26.2749859Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2750233Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2750613Z     T=1,
2025-05-07T20:31:26.2750800Z     D=5120,
2025-05-07T20:31:26.2750998Z     contiguous=True,
2025-05-07T20:31:26.2751220Z     compiled=False,
2025-05-07T20:31:26.2751431Z )
2025-05-07T20:31:26.2751631Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2752005Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2752385Z     T=2048,
2025-05-07T20:31:26.2752575Z     D=7168,
2025-05-07T20:31:26.2752771Z     contiguous=False,
2025-05-07T20:31:26.2753000Z     compiled=True,
2025-05-07T20:31:26.2753216Z )
2025-05-07T20:31:26.2753418Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2753800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2754273Z     T=2048,
2025-05-07T20:31:26.2754458Z     D=7168,
2025-05-07T20:31:26.2754661Z     contiguous=False,
2025-05-07T20:31:26.2754893Z     compiled=False,
2025-05-07T20:31:26.2755097Z )
2025-05-07T20:31:26.2755302Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2755958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2756339Z     T=16384,
2025-05-07T20:31:26.2756538Z     D=7168,
2025-05-07T20:31:26.2756745Z     contiguous=False,
2025-05-07T20:31:26.2756969Z     compiled=True,
2025-05-07T20:31:26.2757178Z )
2025-05-07T20:31:26.2757380Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2757759Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2758134Z     T=16384,
2025-05-07T20:31:26.2758334Z     D=7168,
2025-05-07T20:31:26.2758541Z     contiguous=True,
2025-05-07T20:31:26.2758762Z     compiled=True,
2025-05-07T20:31:26.2758967Z )
2025-05-07T20:31:26.2759167Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2759549Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2759928Z     T=4096,
2025-05-07T20:31:26.2760119Z     D=7168,
2025-05-07T20:31:26.2760312Z     contiguous=True,
2025-05-07T20:31:26.2760538Z     compiled=True,
2025-05-07T20:31:26.2760744Z )
2025-05-07T20:31:26.2760938Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2761315Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2761694Z     T=2048,
2025-05-07T20:31:26.2761879Z     D=5120,
2025-05-07T20:31:26.2762083Z     contiguous=False,
2025-05-07T20:31:26.2762311Z     compiled=False,
2025-05-07T20:31:26.2762514Z )
2025-05-07T20:31:26.2762718Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2763234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2763621Z     T=2048,
2025-05-07T20:31:26.2763806Z     D=5120,
2025-05-07T20:31:26.2764007Z     contiguous=True,
2025-05-07T20:31:26.2764238Z     compiled=False,
2025-05-07T20:31:26.2764441Z )
2025-05-07T20:31:26.2764640Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2765020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2765395Z     T=128,
2025-05-07T20:31:26.2765585Z     D=7168,
2025-05-07T20:31:26.2765792Z     contiguous=False,
2025-05-07T20:31:26.2766017Z     compiled=True,
2025-05-07T20:31:26.2766225Z )
2025-05-07T20:31:26.2775087Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2775516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2775936Z     T=16384,
2025-05-07T20:31:26.2776144Z     D=5120,
2025-05-07T20:31:26.2776340Z     contiguous=True,
2025-05-07T20:31:26.2776583Z     compiled=True,
2025-05-07T20:31:26.2776799Z )
2025-05-07T20:31:26.2777011Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2777396Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2777783Z     T=2048,
2025-05-07T20:31:26.2777983Z     D=5120,
2025-05-07T20:31:26.2778241Z     contiguous=False,
2025-05-07T20:31:26.2778488Z     compiled=True,
2025-05-07T20:31:26.2778696Z )
2025-05-07T20:31:26.2778898Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2779329Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2779767Z     T=16384,
2025-05-07T20:31:26.2779968Z     D=5120,
2025-05-07T20:31:26.2780170Z     contiguous=True,
2025-05-07T20:31:26.2780401Z     compiled=False,
2025-05-07T20:31:26.2780613Z )
2025-05-07T20:31:26.2780834Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2781239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2781625Z     T=16384,
2025-05-07T20:31:26.2781820Z     D=7168,
2025-05-07T20:31:26.2782021Z     contiguous=False,
2025-05-07T20:31:26.2782260Z     compiled=False,
2025-05-07T20:31:26.2782462Z )
2025-05-07T20:31:26.2782668Z Trying example: test_silu_mul(
2025-05-07T20:31:26.2783048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:26.2783609Z     T=16384,
2025-05-07T20:31:26.2783798Z     D=7168,
2025-05-07T20:31:26.2783997Z     contiguous=True,
2025-05-07T20:31:26.2784224Z     compiled=False,
2025-05-07T20:31:26.2784424Z )
2025-05-07T20:31:26.2784608Z PASSED
2025-05-07T20:31:26.3366993Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.3368211Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:26.3369586Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.3371105Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.3372526Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.3373944Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.3375471Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.3376898Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.3378410Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.3379685Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:26.3380951Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.3382208Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:26.3383267Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:26.3384308Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:26.3385546Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.3386843Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.3387972Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:26.3389029Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:26.3390354Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.3391729Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.3392802Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.3393722Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.3394473Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:26.3395512Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.3535687Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.3536775Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:26.3538360Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.3539820Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.3541241Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.3542656Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.3543987Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.3545396Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.3546847Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.3548117Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:26.3549364Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.3550613Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:26.3551703Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:26.3552872Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:26.3554117Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.3555426Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.3556827Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:26.3557887Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:26.3559096Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.3560485Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.3561624Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.3562557Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.3563459Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:26.3564506Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.3952972Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.3954122Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:26.3955946Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.3957632Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.3959266Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.3960942Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.3962471Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.3964091Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.3965935Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.3967398Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:26.3968828Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.3970247Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:26.3971511Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:26.3972696Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:26.3974125Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.3975631Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.3976934Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:26.3978366Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:26.3979754Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.3981410Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.3982645Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.3983700Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.3984551Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:26.3985734Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.3996915Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:26.3998153Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:26.3999730Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:26.4001474Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:26.4003100Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:26.4004843Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.4006375Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:26.4008004Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.4009679Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:26.4011199Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:26.4012627Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:26.4014053Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:26.4015348Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:26.4016537Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:26.4017973Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:26.4019582Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:26.4020886Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:26.4022108Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:26.4023490Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:26.4025087Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:26.4026322Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.4027370Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:26.4028213Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:26.4029400Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.8484592Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:26.8485950Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:26.8486771Z     T=1,
2025-05-07T20:31:26.8487145Z     D=5120,
2025-05-07T20:31:26.8487542Z     scale_ub=None,
2025-05-07T20:31:26.8487982Z     contiguous=True,
2025-05-07T20:31:26.8488431Z     compiled=True,
2025-05-07T20:31:26.8488852Z )
2025-05-07T20:31:26.8489504Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:26.8490416Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:26.8490727Z 
2025-05-07T20:31:26.8490816Z     @given(
2025-05-07T20:31:26.8491074Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:26.8491397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:26.8491709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:26.8492064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:26.8492409Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:26.8492883Z     )
2025-05-07T20:31:26.8493250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:26.8493706Z     def test_silu_mul_quant(
2025-05-07T20:31:26.8493950Z         self,
2025-05-07T20:31:26.8494157Z         T: int,
2025-05-07T20:31:26.8494362Z         D: int,
2025-05-07T20:31:26.8494595Z         scale_ub: Optional[float],
2025-05-07T20:31:26.8494874Z         contiguous: bool,
2025-05-07T20:31:26.8495122Z         compiled: bool,
2025-05-07T20:31:26.8495357Z     ) -> None:
2025-05-07T20:31:26.8495578Z         torch.manual_seed(2025)
2025-05-07T20:31:26.8495830Z     
2025-05-07T20:31:26.8496336Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:26.8496735Z     
2025-05-07T20:31:26.8496944Z         x_sign = torch.sign(x)
2025-05-07T20:31:26.8497280Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:26.8497624Z         x = x_sign * x_clamp
2025-05-07T20:31:26.8497893Z         x0 = x[:, :D]
2025-05-07T20:31:26.8498179Z         x1 = x[:, D:]
2025-05-07T20:31:26.8498406Z     
2025-05-07T20:31:26.8498600Z         if contiguous:
2025-05-07T20:31:26.8498839Z             x0 = x0.contiguous()
2025-05-07T20:31:26.8499110Z             x1 = x1.contiguous()
2025-05-07T20:31:26.8499364Z     
2025-05-07T20:31:26.8499563Z         if scale_ub is not None:
2025-05-07T20:31:26.8499850Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:26.8500201Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:26.8500512Z             )
2025-05-07T20:31:26.8500716Z         else:
2025-05-07T20:31:26.8500946Z             scale_ub_tensor = None
2025-05-07T20:31:26.8501203Z     
2025-05-07T20:31:26.8501451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:26.8501786Z             op = silu_mul_quant
2025-05-07T20:31:26.8502045Z             if compiled:
2025-05-07T20:31:26.8502297Z                 op = torch.compile(op)
2025-05-07T20:31:26.8502610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:26.8502894Z     
2025-05-07T20:31:26.8503092Z         y_fp8, y_scale = fn()
2025-05-07T20:31:26.8503391Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:26.8503692Z     
2025-05-07T20:31:26.8503938Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:26.8504283Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:26.8504588Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:26.8504911Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:26.8505285Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:26.8505607Z     
2025-05-07T20:31:26.8505826Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:26.8506030Z 
2025-05-07T20:31:26.8506296Z moe/activation_test.py:126: 
2025-05-07T20:31:26.8506604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:26.8506944Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:26.8507280Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:26.8508090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:26.8508865Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:26.8509426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:26.8510121Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:26.8510833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:26.8511572Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:26.8512339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:26.8513098Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:26.8513846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:26.8514501Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:26.8515110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:26.8515643Z     fn()
2025-05-07T20:31:26.8516249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:26.8516848Z     self.fn.run(
2025-05-07T20:31:26.8517322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:26.8517876Z     kernel = self.compile(
2025-05-07T20:31:26.8518432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:26.8519095Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:26.8519499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:26.8519733Z 
2025-05-07T20:31:26.8519946Z self = <triton.compiler.compiler.ASTSource object at 0x7fc53c0d6a70>
2025-05-07T20:31:26.8521050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:26.8522471Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53d8f4af0>}
2025-05-07T20:31:26.8523850Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:26.8524901Z context = <triton._C.libtriton.ir.context object at 0x7fc53d8bf3f0>
2025-05-07T20:31:26.8525194Z 
2025-05-07T20:31:26.8525372Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:26.8525906Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:26.8526380Z                            module_map=module_map)
2025-05-07T20:31:26.8526759Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:26.8527128Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:26.8527403Z E       ^
2025-05-07T20:31:26.8527881Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:26.8528458Z 
2025-05-07T20:31:26.8528893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:26.8529416Z 
2025-05-07T20:31:26.8529536Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:26.8529958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:26.8530370Z     T=2048,
2025-05-07T20:31:26.8530564Z     D=5120,
2025-05-07T20:31:26.8530761Z     scale_ub=1200.0,
2025-05-07T20:31:26.8531002Z     contiguous=True,
2025-05-07T20:31:26.8531274Z     compiled=False,
2025-05-07T20:31:26.8531481Z )
2025-05-07T20:31:27.3899048Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.3900158Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:27.3901599Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.3903063Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.3904483Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.3906075Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.3907430Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.3909038Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.3910601Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.3911880Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:27.3913123Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.3914360Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:27.3915411Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:27.3916447Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:27.3917691Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.3918993Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.3920292Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:27.3921398Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:27.3922599Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.3923983Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.3925060Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.3925995Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.3926742Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:27.3927777Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:27.5685338Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:27.5686609Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:27.5687980Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:27.5689433Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:27.5690842Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:27.5692267Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:27.5693604Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:27.5695019Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:27.5696459Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:27.5697731Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:27.5699120Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:27.5700487Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:27.5701597Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:27.5702634Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:27.5703883Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:27.5705198Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:27.5706347Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:27.5707411Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:27.5708607Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:27.5710072Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:27.5711155Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:27.5712090Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:27.5712844Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:27.5713877Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.0638266Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:28.0639377Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:28.0640747Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:28.0642258Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:28.0643672Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:28.0645085Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:28.0646421Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:28.0648015Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:28.0649465Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:28.0650777Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:28.0652032Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:28.0653264Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:28.0654318Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:28.0655355Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:28.0656915Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:28.0658478Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:28.0659635Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:28.0660703Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:28.0661954Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:28.0663332Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:28.0664408Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:28.0665339Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:28.0666091Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:28.0667127Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.0941758Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:28.0942845Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:28.0944527Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:28.0946184Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:28.0947604Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:28.0949016Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:28.0950368Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:28.0951783Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:28.0953237Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:28.0954517Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:28.0956200Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:28.0957524Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:28.0958819Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:28.0960089Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:28.0961671Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:28.0963289Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:28.0964679Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:28.0965981Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:28.0967459Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:28.0969165Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:28.0970486Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:28.0971657Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:28.0972569Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:28.0973606Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.7810838Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:28.7811583Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:28.7811921Z 
2025-05-07T20:31:28.7812006Z     @given(
2025-05-07T20:31:28.7812277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:28.7820669Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:28.7821207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:28.7821738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:28.7822219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:28.7822629Z     )
2025-05-07T20:31:28.7823133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:28.7823763Z     def test_silu_mul_quant(
2025-05-07T20:31:28.7824115Z         self,
2025-05-07T20:31:28.7824403Z         T: int,
2025-05-07T20:31:28.7824681Z         D: int,
2025-05-07T20:31:28.7824933Z         scale_ub: Optional[float],
2025-05-07T20:31:28.7825212Z         contiguous: bool,
2025-05-07T20:31:28.7825453Z         compiled: bool,
2025-05-07T20:31:28.7825689Z     ) -> None:
2025-05-07T20:31:28.7825914Z         torch.manual_seed(2025)
2025-05-07T20:31:28.7826153Z     
2025-05-07T20:31:28.7826434Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:28.7826971Z     
2025-05-07T20:31:28.7827170Z         x_sign = torch.sign(x)
2025-05-07T20:31:28.7827471Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:28.7827793Z         x = x_sign * x_clamp
2025-05-07T20:31:28.7828038Z         x0 = x[:, :D]
2025-05-07T20:31:28.7828252Z         x1 = x[:, D:]
2025-05-07T20:31:28.7828466Z     
2025-05-07T20:31:28.7828661Z         if contiguous:
2025-05-07T20:31:28.7828890Z             x0 = x0.contiguous()
2025-05-07T20:31:28.7829158Z             x1 = x1.contiguous()
2025-05-07T20:31:28.7829406Z     
2025-05-07T20:31:28.7829596Z         if scale_ub is not None:
2025-05-07T20:31:28.7829878Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:28.7830228Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:28.7830538Z             )
2025-05-07T20:31:28.7830737Z         else:
2025-05-07T20:31:28.7830954Z             scale_ub_tensor = None
2025-05-07T20:31:28.7831210Z     
2025-05-07T20:31:28.7831458Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:28.7831783Z             op = silu_mul_quant
2025-05-07T20:31:28.7832031Z             if compiled:
2025-05-07T20:31:28.7832291Z                 op = torch.compile(op)
2025-05-07T20:31:28.7832595Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:28.7832883Z     
2025-05-07T20:31:28.7833078Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:28.7833252Z 
2025-05-07T20:31:28.7833355Z moe/activation_test.py:117: 
2025-05-07T20:31:28.7833663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:28.7833996Z moe/activation_test.py:115: in fn
2025-05-07T20:31:28.7834290Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:28.7835004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:28.7835708Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:28.7836264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:28.7836964Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:28.7837779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:28.7838315Z     kernel = self.compile(
2025-05-07T20:31:28.7838876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:28.7839547Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:28.7839949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:28.7840177Z 
2025-05-07T20:31:28.7840388Z self = <triton.compiler.compiler.ASTSource object at 0x7fc53d962590>
2025-05-07T20:31:28.7841560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:28.7842972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53d946ef0>}
2025-05-07T20:31:28.7844356Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:28.7845399Z context = <triton._C.libtriton.ir.context object at 0x7fc53d33bd30>
2025-05-07T20:31:28.7845697Z 
2025-05-07T20:31:28.7845867Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:28.7846399Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:28.7846879Z                            module_map=module_map)
2025-05-07T20:31:28.7847330Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:28.7847693Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:28.7847957Z E       ^
2025-05-07T20:31:28.7848431Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.7848897Z 
2025-05-07T20:31:28.7849322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:28.7849848Z 
2025-05-07T20:31:28.7849953Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:28.7850377Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:28.7850778Z     T=2048,
2025-05-07T20:31:28.7850972Z     D=5120,
2025-05-07T20:31:28.7851221Z     scale_ub=1200.0,
2025-05-07T20:31:28.7851460Z     contiguous=True,
2025-05-07T20:31:28.7851686Z     compiled=True,
2025-05-07T20:31:28.7851898Z )
2025-05-07T20:31:28.7852227Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:28.7852732Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:28.7853016Z 
2025-05-07T20:31:28.7853092Z     @given(
2025-05-07T20:31:28.7853327Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:28.7853639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:28.7853953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:28.7854292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:28.7854622Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:28.7854919Z     )
2025-05-07T20:31:28.7855275Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:28.7856031Z     def test_silu_mul_quant(
2025-05-07T20:31:28.7856272Z         self,
2025-05-07T20:31:28.7856469Z         T: int,
2025-05-07T20:31:28.7856668Z         D: int,
2025-05-07T20:31:28.7856882Z         scale_ub: Optional[float],
2025-05-07T20:31:28.7857161Z         contiguous: bool,
2025-05-07T20:31:28.7857402Z         compiled: bool,
2025-05-07T20:31:28.7857625Z     ) -> None:
2025-05-07T20:31:28.7857845Z         torch.manual_seed(2025)
2025-05-07T20:31:28.7858307Z     
2025-05-07T20:31:28.7858579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:28.7858925Z     
2025-05-07T20:31:28.7859126Z         x_sign = torch.sign(x)
2025-05-07T20:31:28.7859418Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:28.7859736Z         x = x_sign * x_clamp
2025-05-07T20:31:28.7859977Z         x0 = x[:, :D]
2025-05-07T20:31:28.7860190Z         x1 = x[:, D:]
2025-05-07T20:31:28.7860401Z     
2025-05-07T20:31:28.7860589Z         if contiguous:
2025-05-07T20:31:28.7860818Z             x0 = x0.contiguous()
2025-05-07T20:31:28.7861078Z             x1 = x1.contiguous()
2025-05-07T20:31:28.7861320Z     
2025-05-07T20:31:28.7861510Z         if scale_ub is not None:
2025-05-07T20:31:28.7861792Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:28.7862134Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:28.7862446Z             )
2025-05-07T20:31:28.7862640Z         else:
2025-05-07T20:31:28.7862853Z             scale_ub_tensor = None
2025-05-07T20:31:28.7863116Z     
2025-05-07T20:31:28.7863347Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:28.7863664Z             op = silu_mul_quant
2025-05-07T20:31:28.7863917Z             if compiled:
2025-05-07T20:31:28.7864162Z                 op = torch.compile(op)
2025-05-07T20:31:28.7864463Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:28.7864765Z     
2025-05-07T20:31:28.7864962Z         y_fp8, y_scale = fn()
2025-05-07T20:31:28.7865248Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:28.7865541Z     
2025-05-07T20:31:28.7865787Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:28.7866123Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:28.7866553Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:28.7866880Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:28.7867244Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:28.7867561Z     
2025-05-07T20:31:28.7867768Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:28.7867968Z 
2025-05-07T20:31:28.7868077Z moe/activation_test.py:126: 
2025-05-07T20:31:28.7868368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:28.7868705Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:28.7869040Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:28.7869837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:28.7870605Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:28.7871173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:28.7871919Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:28.7872622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:28.7873357Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:28.7874124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:28.7874883Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:28.7875620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:28.7876272Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:28.7876889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:28.7877412Z     fn()
2025-05-07T20:31:28.7877934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:28.7878620Z     self.fn.run(
2025-05-07T20:31:28.7879099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:28.7879634Z     kernel = self.compile(
2025-05-07T20:31:28.7880187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:28.7880853Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:28.7881247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:28.7881506Z 
2025-05-07T20:31:28.7881742Z self = <triton.compiler.compiler.ASTSource object at 0x7fc53d8e08e0>
2025-05-07T20:31:28.7882849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:28.7884254Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53d9b4790>}
2025-05-07T20:31:28.7885619Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:28.7886656Z context = <triton._C.libtriton.ir.context object at 0x7fc5159a5970>
2025-05-07T20:31:28.7886952Z 
2025-05-07T20:31:28.7887120Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:28.7887731Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:28.7888212Z                            module_map=module_map)
2025-05-07T20:31:28.7888578Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:28.7888956Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:28.7889228Z E       ^
2025-05-07T20:31:28.7889695Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:28.7890158Z 
2025-05-07T20:31:28.7890582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:28.7891106Z 
2025-05-07T20:31:28.7891214Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:28.7891635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:28.7892038Z     T=16384,
2025-05-07T20:31:28.7892237Z     D=7168,
2025-05-07T20:31:28.7892436Z     scale_ub=1200.0,
2025-05-07T20:31:28.7892659Z     contiguous=False,
2025-05-07T20:31:28.7892896Z     compiled=False,
2025-05-07T20:31:28.7893107Z )
2025-05-07T20:31:29.1591698Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:29.1593217Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:29.1594670Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:29.1596119Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:29.1597532Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:29.1599202Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.1600528Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:29.1601940Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.1603392Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:29.1604667Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:29.1605920Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:29.1607155Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:29.1608205Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:29.1609358Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:29.1610609Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:29.1611928Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:29.1613068Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:29.1614126Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:29.1615334Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:29.1616718Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:29.1617807Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.1618816Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:29.1619568Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:29.1620607Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.2994529Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:29.2996024Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:29.2997392Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:29.2998845Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:29.3000254Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:29.3001669Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.3003007Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:29.3004409Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.3005973Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:29.3007239Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:29.3008486Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:29.3009716Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:29.3010769Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:29.3011861Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:29.3013107Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:29.3014419Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:29.3015559Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:29.3016617Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:29.3017821Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:29.3019281Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:29.3020447Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.3021378Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:29.3022189Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:29.3023223Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.7312690Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:29.7314052Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:29.7315422Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:29.7316879Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:29.7318486Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:29.7319913Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.7321243Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:29.7322652Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.7324099Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:29.7325381Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:29.7326627Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:29.7327857Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:29.7328911Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:29.7329952Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:29.7331205Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:29.7332692Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:29.7333828Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:29.7334889Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:29.7336091Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:29.7337479Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:29.7338686Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.7339618Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:29.7340376Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:29.7341415Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:29.7614239Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:29.7615439Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:29.7616808Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:29.7618355Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:29.7619768Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:29.7621176Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:29.7622506Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:29.7623904Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:29.7625346Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:29.7626616Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:29.7627861Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:29.7629230Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:29.7630286Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:29.7631327Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:29.7632575Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:29.7633886Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:29.7635023Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:29.7636081Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:29.7637280Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:29.7638737Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:29.7639827Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:29.7640747Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:29.7641501Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:29.7642539Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.0835720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:31.0836350Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:31.0836752Z 
2025-05-07T20:31:31.0836876Z     @given(
2025-05-07T20:31:31.0837120Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:31.0837463Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:31.0837782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:31.0838123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:31.0838469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:31.0838767Z     )
2025-05-07T20:31:31.0839139Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:31.0839592Z     def test_silu_mul_quant(
2025-05-07T20:31:31.0839849Z         self,
2025-05-07T20:31:31.0840055Z         T: int,
2025-05-07T20:31:31.0840254Z         D: int,
2025-05-07T20:31:31.0840484Z         scale_ub: Optional[float],
2025-05-07T20:31:31.0840768Z         contiguous: bool,
2025-05-07T20:31:31.0841010Z         compiled: bool,
2025-05-07T20:31:31.0841251Z     ) -> None:
2025-05-07T20:31:31.0841475Z         torch.manual_seed(2025)
2025-05-07T20:31:31.0841725Z     
2025-05-07T20:31:31.0842015Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:31.0842723Z     
2025-05-07T20:31:31.0842920Z         x_sign = torch.sign(x)
2025-05-07T20:31:31.0843229Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:31.0843549Z         x = x_sign * x_clamp
2025-05-07T20:31:31.0843789Z         x0 = x[:, :D]
2025-05-07T20:31:31.0844015Z         x1 = x[:, D:]
2025-05-07T20:31:31.0844233Z     
2025-05-07T20:31:31.0844429Z         if contiguous:
2025-05-07T20:31:31.0844669Z             x0 = x0.contiguous()
2025-05-07T20:31:31.0844940Z             x1 = x1.contiguous()
2025-05-07T20:31:31.0845188Z     
2025-05-07T20:31:31.0845383Z         if scale_ub is not None:
2025-05-07T20:31:31.0845667Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:31.0846025Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:31.0846343Z             )
2025-05-07T20:31:31.0846547Z         else:
2025-05-07T20:31:31.0846767Z             scale_ub_tensor = None
2025-05-07T20:31:31.0847034Z     
2025-05-07T20:31:31.0847281Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:31.0847606Z             op = silu_mul_quant
2025-05-07T20:31:31.0847861Z             if compiled:
2025-05-07T20:31:31.0848128Z                 op = torch.compile(op)
2025-05-07T20:31:31.0848438Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:31.0848716Z     
2025-05-07T20:31:31.0848922Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:31.0849098Z 
2025-05-07T20:31:31.0849205Z moe/activation_test.py:117: 
2025-05-07T20:31:31.0849515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:31.0849850Z moe/activation_test.py:115: in fn
2025-05-07T20:31:31.0850143Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:31.0851015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:31.0851729Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:31.0852297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:31.0853002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:31.0853685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:31.0854229Z     kernel = self.compile(
2025-05-07T20:31:31.0854792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:31.0855472Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.0856257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:31.0856497Z 
2025-05-07T20:31:31.0856718Z self = <triton.compiler.compiler.ASTSource object at 0x7fc515ad4580>
2025-05-07T20:31:31.0857830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:31.0859363Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53d28d2d0>}
2025-05-07T20:31:31.0860737Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:31.0861788Z context = <triton._C.libtriton.ir.context object at 0x7fc515a274f0>
2025-05-07T20:31:31.0862081Z 
2025-05-07T20:31:31.0862253Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:31.0862790Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.0863273Z                            module_map=module_map)
2025-05-07T20:31:31.0863792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.0864156Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:31.0864424Z E       ^
2025-05-07T20:31:31.0864901Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.0865361Z 
2025-05-07T20:31:31.0865790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:31.0866323Z 
2025-05-07T20:31:31.0866431Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:31.0866858Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:31.0867265Z     T=1,
2025-05-07T20:31:31.0867459Z     D=7168,
2025-05-07T20:31:31.0867662Z     scale_ub=None,
2025-05-07T20:31:31.0867890Z     contiguous=True,
2025-05-07T20:31:31.0868117Z     compiled=True,
2025-05-07T20:31:31.0868330Z )
2025-05-07T20:31:31.0868666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:31.0869155Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:31.0869440Z 
2025-05-07T20:31:31.0869519Z     @given(
2025-05-07T20:31:31.0869762Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:31.0870079Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:31.0878150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:31.0878522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:31.0878885Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:31.0879184Z     )
2025-05-07T20:31:31.0879546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:31.0880148Z     def test_silu_mul_quant(
2025-05-07T20:31:31.0880404Z         self,
2025-05-07T20:31:31.0880607Z         T: int,
2025-05-07T20:31:31.0880814Z         D: int,
2025-05-07T20:31:31.0881036Z         scale_ub: Optional[float],
2025-05-07T20:31:31.0881319Z         contiguous: bool,
2025-05-07T20:31:31.0881566Z         compiled: bool,
2025-05-07T20:31:31.0881790Z     ) -> None:
2025-05-07T20:31:31.0882015Z         torch.manual_seed(2025)
2025-05-07T20:31:31.0882267Z     
2025-05-07T20:31:31.0882548Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:31.0882898Z     
2025-05-07T20:31:31.0883103Z         x_sign = torch.sign(x)
2025-05-07T20:31:31.0883403Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:31.0883721Z         x = x_sign * x_clamp
2025-05-07T20:31:31.0883969Z         x0 = x[:, :D]
2025-05-07T20:31:31.0884187Z         x1 = x[:, D:]
2025-05-07T20:31:31.0884403Z     
2025-05-07T20:31:31.0884597Z         if contiguous:
2025-05-07T20:31:31.0884840Z             x0 = x0.contiguous()
2025-05-07T20:31:31.0885108Z             x1 = x1.contiguous()
2025-05-07T20:31:31.0885358Z     
2025-05-07T20:31:31.0885553Z         if scale_ub is not None:
2025-05-07T20:31:31.0885840Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:31.0886187Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:31.0886509Z             )
2025-05-07T20:31:31.0886706Z         else:
2025-05-07T20:31:31.0886929Z             scale_ub_tensor = None
2025-05-07T20:31:31.0887194Z     
2025-05-07T20:31:31.0887431Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:31.0887759Z             op = silu_mul_quant
2025-05-07T20:31:31.0888018Z             if compiled:
2025-05-07T20:31:31.0888268Z                 op = torch.compile(op)
2025-05-07T20:31:31.0888577Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:31.0888860Z     
2025-05-07T20:31:31.0889054Z         y_fp8, y_scale = fn()
2025-05-07T20:31:31.0889354Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:31.0889654Z     
2025-05-07T20:31:31.0889897Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:31.0890343Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:31.0890646Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:31.0890972Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:31.0891339Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:31.0891660Z     
2025-05-07T20:31:31.0891873Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:31.0892074Z 
2025-05-07T20:31:31.0892178Z moe/activation_test.py:126: 
2025-05-07T20:31:31.0892484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:31.0892829Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:31.0893162Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:31.0893972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:31.0894740Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:31.0895306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:31.0895994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:31.0896698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:31.0897432Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:31.0898283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:31.0899038Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:31.0899863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:31.0900521Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:31.0901143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:31.0901666Z     fn()
2025-05-07T20:31:31.0902234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:31.0902831Z     self.fn.run(
2025-05-07T20:31:31.0903307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:31.0903849Z     kernel = self.compile(
2025-05-07T20:31:31.0904401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:31.0905069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.0905475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:31.0905710Z 
2025-05-07T20:31:31.0905923Z self = <triton.compiler.compiler.ASTSource object at 0x7fc52c0b2fe0>
2025-05-07T20:31:31.0907036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:31.0908437Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53c0fd5a0>}
2025-05-07T20:31:31.0909799Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:31.0910848Z context = <triton._C.libtriton.ir.context object at 0x7fc5172f98f0>
2025-05-07T20:31:31.0911150Z 
2025-05-07T20:31:31.0911323Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:31.0911854Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.0912463Z                            module_map=module_map)
2025-05-07T20:31:31.0912841Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.0913208Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:31.0913491Z E       ^
2025-05-07T20:31:31.0913965Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:31.0914430Z 
2025-05-07T20:31:31.0914852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:31.0915376Z 
2025-05-07T20:31:31.0915491Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:31.0915919Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:31.0916326Z     T=4096,
2025-05-07T20:31:31.0916517Z     D=5120,
2025-05-07T20:31:31.0916712Z     scale_ub=None,
2025-05-07T20:31:31.0916932Z     contiguous=False,
2025-05-07T20:31:31.0917163Z     compiled=False,
2025-05-07T20:31:31.0917371Z )
2025-05-07T20:31:31.6426617Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:31.6427721Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:31.6429237Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:31.6430900Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:31.6432333Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:31.6433762Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:31.6435114Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:31.6436535Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:31.6438002Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:31.6439402Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:31.6440838Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:31.6442081Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:31.6443151Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:31.6444198Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:31.6445655Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:31.6446974Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:31.6448114Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:31.6449180Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:31.6450386Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:31.6451783Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:31.6452862Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:31.6453793Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:31.6454550Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:31.6455972Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.1852643Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.1853740Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:32.1855108Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.1856956Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.1858456Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.1859875Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.1861210Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.1862623Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.1864072Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.1865545Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:32.1866787Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.1868018Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:32.1869076Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:32.1870111Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:32.1871361Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.1872669Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.1873813Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.1874987Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:32.1876186Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.1877578Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.1878659Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.1879591Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.1880346Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:32.1881386Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.8717347Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.8718443Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:32.8719806Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.8721257Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.8722667Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.8724253Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.8725583Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.8726984Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.8728433Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.8729702Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:32.8730943Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.8732174Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:32.8733284Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:32.8734429Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:32.8735669Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.8736968Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.8738165Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.8739229Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:32.8740425Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.8741806Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.8742932Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.8743860Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.8744616Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:32.8745651Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:32.9017314Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:32.9020516Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:32.9022816Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:32.9024272Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:32.9025702Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:32.9027121Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:32.9028456Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:32.9029859Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:32.9031420Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:32.9032753Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:32.9033997Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:32.9035232Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:32.9036284Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:32.9037329Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:32.9038573Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:32.9039886Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:32.9041020Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:32.9042091Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:32.9043351Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:32.9044818Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:32.9045895Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:32.9046821Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:32.9047577Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:32.9048620Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.2005473Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.2007088Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:36.2007773Z 
2025-05-07T20:31:36.2007950Z     @given(
2025-05-07T20:31:36.2008440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.2009095Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.2009728Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.2010409Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.2011081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.2011669Z     )
2025-05-07T20:31:36.2012397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.2013300Z     def test_silu_mul_quant(
2025-05-07T20:31:36.2013612Z         self,
2025-05-07T20:31:36.2013841Z         T: int,
2025-05-07T20:31:36.2014384Z         D: int,
2025-05-07T20:31:36.2014622Z         scale_ub: Optional[float],
2025-05-07T20:31:36.2014908Z         contiguous: bool,
2025-05-07T20:31:36.2015159Z         compiled: bool,
2025-05-07T20:31:36.2015403Z     ) -> None:
2025-05-07T20:31:36.2015633Z         torch.manual_seed(2025)
2025-05-07T20:31:36.2015882Z     
2025-05-07T20:31:36.2016172Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.2016527Z     
2025-05-07T20:31:36.2016730Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.2017028Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.2017351Z         x = x_sign * x_clamp
2025-05-07T20:31:36.2017603Z         x0 = x[:, :D]
2025-05-07T20:31:36.2017825Z         x1 = x[:, D:]
2025-05-07T20:31:36.2018116Z     
2025-05-07T20:31:36.2018316Z         if contiguous:
2025-05-07T20:31:36.2018556Z             x0 = x0.contiguous()
2025-05-07T20:31:36.2018826Z             x1 = x1.contiguous()
2025-05-07T20:31:36.2019085Z     
2025-05-07T20:31:36.2019288Z         if scale_ub is not None:
2025-05-07T20:31:36.2019574Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.2019932Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.2020245Z             )
2025-05-07T20:31:36.2020448Z         else:
2025-05-07T20:31:36.2020668Z             scale_ub_tensor = None
2025-05-07T20:31:36.2020926Z     
2025-05-07T20:31:36.2021172Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.2021498Z             op = silu_mul_quant
2025-05-07T20:31:36.2021760Z             if compiled:
2025-05-07T20:31:36.2022014Z                 op = torch.compile(op)
2025-05-07T20:31:36.2022324Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.2022612Z     
2025-05-07T20:31:36.2022812Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.2022988Z 
2025-05-07T20:31:36.2023100Z moe/activation_test.py:117: 
2025-05-07T20:31:36.2023413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.2023749Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.2024044Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.2024932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.2025660Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.2026220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.2026929Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.2027619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.2028167Z     kernel = self.compile(
2025-05-07T20:31:36.2028730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.2029418Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.2029831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.2030069Z 
2025-05-07T20:31:36.2030285Z self = <triton.compiler.compiler.ASTSource object at 0x7fc515ad4760>
2025-05-07T20:31:36.2031406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.2032844Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53f59e7a0>}
2025-05-07T20:31:36.2034276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.2035417Z context = <triton._C.libtriton.ir.context object at 0x7fc517157e70>
2025-05-07T20:31:36.2035714Z 
2025-05-07T20:31:36.2035885Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.2036431Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.2036920Z                            module_map=module_map)
2025-05-07T20:31:36.2037291Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.2037655Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.2037922Z E       ^
2025-05-07T20:31:36.2038395Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.2038864Z 
2025-05-07T20:31:36.2039292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.2039822Z 
2025-05-07T20:31:36.2039937Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.2040363Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.2040767Z     T=4096,
2025-05-07T20:31:36.2040974Z     D=7168,
2025-05-07T20:31:36.2041172Z     scale_ub=None,
2025-05-07T20:31:36.2041388Z     contiguous=False,
2025-05-07T20:31:36.2041621Z     compiled=False,
2025-05-07T20:31:36.2041838Z )
2025-05-07T20:31:36.2042166Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.2042670Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:36.2042953Z 
2025-05-07T20:31:36.2043030Z     @given(
2025-05-07T20:31:36.2043277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.2043593Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.2043909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.2044250Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.2044593Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.2044886Z     )
2025-05-07T20:31:36.2045251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.2045793Z     def test_silu_mul_quant(
2025-05-07T20:31:36.2046036Z         self,
2025-05-07T20:31:36.2046236Z         T: int,
2025-05-07T20:31:36.2046441Z         D: int,
2025-05-07T20:31:36.2046662Z         scale_ub: Optional[float],
2025-05-07T20:31:36.2046941Z         contiguous: bool,
2025-05-07T20:31:36.2047187Z         compiled: bool,
2025-05-07T20:31:36.2047412Z     ) -> None:
2025-05-07T20:31:36.2047636Z         torch.manual_seed(2025)
2025-05-07T20:31:36.2047887Z     
2025-05-07T20:31:36.2048162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.2048517Z     
2025-05-07T20:31:36.2048719Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.2049015Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.2049340Z         x = x_sign * x_clamp
2025-05-07T20:31:36.2049588Z         x0 = x[:, :D]
2025-05-07T20:31:36.2049807Z         x1 = x[:, D:]
2025-05-07T20:31:36.2050022Z     
2025-05-07T20:31:36.2050221Z         if contiguous:
2025-05-07T20:31:36.2050454Z             x0 = x0.contiguous()
2025-05-07T20:31:36.2050724Z             x1 = x1.contiguous()
2025-05-07T20:31:36.2050974Z     
2025-05-07T20:31:36.2051174Z         if scale_ub is not None:
2025-05-07T20:31:36.2051452Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.2051799Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.2052114Z             )
2025-05-07T20:31:36.2052305Z         else:
2025-05-07T20:31:36.2052520Z             scale_ub_tensor = None
2025-05-07T20:31:36.2052782Z     
2025-05-07T20:31:36.2053016Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.2053337Z             op = silu_mul_quant
2025-05-07T20:31:36.2053594Z             if compiled:
2025-05-07T20:31:36.2053929Z                 op = torch.compile(op)
2025-05-07T20:31:36.2054238Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.2054519Z     
2025-05-07T20:31:36.2054712Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.2054891Z 
2025-05-07T20:31:36.2054994Z moe/activation_test.py:117: 
2025-05-07T20:31:36.2055296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.2055828Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.2056116Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.2056827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.2057546Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.2058157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.2058858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.2059548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.2060095Z     kernel = self.compile(
2025-05-07T20:31:36.2060651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.2061330Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.2061737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.2061968Z 
2025-05-07T20:31:36.2062187Z self = <triton.compiler.compiler.ASTSource object at 0x7fc53f459b70>
2025-05-07T20:31:36.2063294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.2064714Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53d946cb0>}
2025-05-07T20:31:36.2066104Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.2067291Z context = <triton._C.libtriton.ir.context object at 0x7fc5171c3070>
2025-05-07T20:31:36.2067586Z 
2025-05-07T20:31:36.2067757Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.2068300Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.2068783Z                            module_map=module_map)
2025-05-07T20:31:36.2069164Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.2069525Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.2069794Z E       ^
2025-05-07T20:31:36.2070283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.2070746Z 
2025-05-07T20:31:36.2071176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.2071715Z 
2025-05-07T20:31:36.2071823Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.2072254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.2072671Z     T=128,
2025-05-07T20:31:36.2072863Z     D=7168,
2025-05-07T20:31:36.2073071Z     scale_ub=None,
2025-05-07T20:31:36.2073324Z     contiguous=False,
2025-05-07T20:31:36.2073582Z     compiled=True,
2025-05-07T20:31:36.2073802Z )
2025-05-07T20:31:36.2741463Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.2742217Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:36.2742636Z 
2025-05-07T20:31:36.2742849Z     @given(
2025-05-07T20:31:36.2743826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.2756021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.2756398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.2756748Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.2757090Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.2757379Z     )
2025-05-07T20:31:36.2757743Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.2758196Z     def test_silu_mul_quant(
2025-05-07T20:31:36.2758440Z         self,
2025-05-07T20:31:36.2758642Z         T: int,
2025-05-07T20:31:36.2758843Z         D: int,
2025-05-07T20:31:36.2759061Z         scale_ub: Optional[float],
2025-05-07T20:31:36.2759342Z         contiguous: bool,
2025-05-07T20:31:36.2759592Z         compiled: bool,
2025-05-07T20:31:36.2759818Z     ) -> None:
2025-05-07T20:31:36.2760045Z         torch.manual_seed(2025)
2025-05-07T20:31:36.2760309Z     
2025-05-07T20:31:36.2760585Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.2760936Z     
2025-05-07T20:31:36.2761145Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.2761443Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.2761751Z         x = x_sign * x_clamp
2025-05-07T20:31:36.2761998Z         x0 = x[:, :D]
2025-05-07T20:31:36.2762217Z         x1 = x[:, D:]
2025-05-07T20:31:36.2762425Z     
2025-05-07T20:31:36.2762618Z         if contiguous:
2025-05-07T20:31:36.2762852Z             x0 = x0.contiguous()
2025-05-07T20:31:36.2763108Z             x1 = x1.contiguous()
2025-05-07T20:31:36.2763376Z     
2025-05-07T20:31:36.2763597Z         if scale_ub is not None:
2025-05-07T20:31:36.2763868Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.2764212Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.2764528Z             )
2025-05-07T20:31:36.2764717Z         else:
2025-05-07T20:31:36.2764940Z             scale_ub_tensor = None
2025-05-07T20:31:36.2765203Z     
2025-05-07T20:31:36.2765437Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.2765939Z             op = silu_mul_quant
2025-05-07T20:31:36.2766196Z             if compiled:
2025-05-07T20:31:36.2766451Z                 op = torch.compile(op)
2025-05-07T20:31:36.2766751Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.2767032Z     
2025-05-07T20:31:36.2767232Z         y_fp8, y_scale = fn()
2025-05-07T20:31:36.2767520Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:36.2767817Z     
2025-05-07T20:31:36.2768062Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.2768395Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:36.2768696Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:36.2769019Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:36.2769385Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:36.2769702Z     
2025-05-07T20:31:36.2769914Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:36.2770114Z 
2025-05-07T20:31:36.2770230Z moe/activation_test.py:126: 
2025-05-07T20:31:36.2770526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.2770867Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:36.2771206Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:36.2772003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:36.2772767Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:36.2773328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.2774015Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.2774841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:36.2775579Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:36.2776348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:36.2777103Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:36.2777842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:36.2778574Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:36.2779187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:36.2779708Z     fn()
2025-05-07T20:31:36.2780231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:36.2780821Z     self.fn.run(
2025-05-07T20:31:36.2781304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.2781859Z     kernel = self.compile(
2025-05-07T20:31:36.2782417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.2783090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.2783516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.2783778Z 
2025-05-07T20:31:36.2783990Z self = <triton.compiler.compiler.ASTSource object at 0x7fc53f5f1780>
2025-05-07T20:31:36.2785101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.2786518Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53e480d30>}
2025-05-07T20:31:36.2788252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.2789308Z context = <triton._C.libtriton.ir.context object at 0x7fc5152be170>
2025-05-07T20:31:36.2789608Z 
2025-05-07T20:31:36.2789777Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.2790310Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.2790780Z                            module_map=module_map)
2025-05-07T20:31:36.2791157Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.2791530Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:36.2791804Z E       ^
2025-05-07T20:31:36.2792272Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.2792743Z 
2025-05-07T20:31:36.2793254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.2793829Z 
2025-05-07T20:31:36.2793936Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.2794360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.2794762Z     T=128,
2025-05-07T20:31:36.2794956Z     D=7168,
2025-05-07T20:31:36.2795151Z     scale_ub=None,
2025-05-07T20:31:36.2795367Z     contiguous=False,
2025-05-07T20:31:36.2795600Z     compiled=False,
2025-05-07T20:31:36.2795810Z )
2025-05-07T20:31:36.4889791Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.4890506Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:36.4890817Z 
2025-05-07T20:31:36.4890901Z     @given(
2025-05-07T20:31:36.4891143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.4891475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.4891784Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.4892125Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.4892467Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.4892756Z     )
2025-05-07T20:31:36.4893123Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.4893611Z     def test_silu_mul_quant(
2025-05-07T20:31:36.4893882Z         self,
2025-05-07T20:31:36.4894084Z         T: int,
2025-05-07T20:31:36.4894289Z         D: int,
2025-05-07T20:31:36.4894510Z         scale_ub: Optional[float],
2025-05-07T20:31:36.4894794Z         contiguous: bool,
2025-05-07T20:31:36.4895042Z         compiled: bool,
2025-05-07T20:31:36.4895274Z     ) -> None:
2025-05-07T20:31:36.4895500Z         torch.manual_seed(2025)
2025-05-07T20:31:36.4895756Z     
2025-05-07T20:31:36.4896038Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.4896388Z     
2025-05-07T20:31:36.4896589Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.4896890Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.4897202Z         x = x_sign * x_clamp
2025-05-07T20:31:36.4897448Z         x0 = x[:, :D]
2025-05-07T20:31:36.4897673Z         x1 = x[:, D:]
2025-05-07T20:31:36.4897886Z     
2025-05-07T20:31:36.4898158Z         if contiguous:
2025-05-07T20:31:36.4898400Z             x0 = x0.contiguous()
2025-05-07T20:31:36.4898661Z             x1 = x1.contiguous()
2025-05-07T20:31:36.4898911Z     
2025-05-07T20:31:36.4899119Z         if scale_ub is not None:
2025-05-07T20:31:36.4899395Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.4899748Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.4900062Z             )
2025-05-07T20:31:36.4900256Z         else:
2025-05-07T20:31:36.4900474Z             scale_ub_tensor = None
2025-05-07T20:31:36.4900871Z     
2025-05-07T20:31:36.4901108Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.4901433Z             op = silu_mul_quant
2025-05-07T20:31:36.4901691Z             if compiled:
2025-05-07T20:31:36.4901949Z                 op = torch.compile(op)
2025-05-07T20:31:36.4902249Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.4902535Z     
2025-05-07T20:31:36.4902740Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.4902911Z 
2025-05-07T20:31:36.4903015Z moe/activation_test.py:117: 
2025-05-07T20:31:36.4903319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.4903664Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.4903950Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.4904665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.4905379Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.4905938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.4906632Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.4907315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.4907866Z     kernel = self.compile(
2025-05-07T20:31:36.4908421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.4909100Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.4909511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.4909740Z 
2025-05-07T20:31:36.4910047Z self = <triton.compiler.compiler.ASTSource object at 0x7fc516e084f0>
2025-05-07T20:31:36.4911151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.4912559Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc5175b0430>}
2025-05-07T20:31:36.4913979Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.4915022Z context = <triton._C.libtriton.ir.context object at 0x7fc515301370>
2025-05-07T20:31:36.4915314Z 
2025-05-07T20:31:36.4915495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.4916027Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.4916512Z                            module_map=module_map)
2025-05-07T20:31:36.4916895Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.4917256Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.4917526Z E       ^
2025-05-07T20:31:36.4918005Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.4918465Z 
2025-05-07T20:31:36.4918896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.4919417Z 
2025-05-07T20:31:36.4919527Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.4919952Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.4920365Z     T=4096,
2025-05-07T20:31:36.4920557Z     D=5120,
2025-05-07T20:31:36.4920764Z     scale_ub=1200.0,
2025-05-07T20:31:36.4920999Z     contiguous=True,
2025-05-07T20:31:36.4921234Z     compiled=False,
2025-05-07T20:31:36.4921528Z )
2025-05-07T20:31:36.4921859Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:36.4922373Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:36.4922650Z 
2025-05-07T20:31:36.4922736Z     @given(
2025-05-07T20:31:36.4922972Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:36.4923298Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:36.4923671Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:36.4924034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:36.4924373Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:36.4924662Z     )
2025-05-07T20:31:36.4925022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:36.4925483Z     def test_silu_mul_quant(
2025-05-07T20:31:36.4925731Z         self,
2025-05-07T20:31:36.4925932Z         T: int,
2025-05-07T20:31:36.4926142Z         D: int,
2025-05-07T20:31:36.4926370Z         scale_ub: Optional[float],
2025-05-07T20:31:36.4926654Z         contiguous: bool,
2025-05-07T20:31:36.4926905Z         compiled: bool,
2025-05-07T20:31:36.4927129Z     ) -> None:
2025-05-07T20:31:36.4927357Z         torch.manual_seed(2025)
2025-05-07T20:31:36.4927610Z     
2025-05-07T20:31:36.4927885Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:36.4928234Z     
2025-05-07T20:31:36.4928438Z         x_sign = torch.sign(x)
2025-05-07T20:31:36.4928731Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:36.4929049Z         x = x_sign * x_clamp
2025-05-07T20:31:36.4929297Z         x0 = x[:, :D]
2025-05-07T20:31:36.4929524Z         x1 = x[:, D:]
2025-05-07T20:31:36.4929729Z     
2025-05-07T20:31:36.4929923Z         if contiguous:
2025-05-07T20:31:36.4930273Z             x0 = x0.contiguous()
2025-05-07T20:31:36.4930535Z             x1 = x1.contiguous()
2025-05-07T20:31:36.4930778Z     
2025-05-07T20:31:36.4930982Z         if scale_ub is not None:
2025-05-07T20:31:36.4931259Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:36.4931598Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:36.4931907Z             )
2025-05-07T20:31:36.4932103Z         else:
2025-05-07T20:31:36.4932319Z             scale_ub_tensor = None
2025-05-07T20:31:36.4932576Z     
2025-05-07T20:31:36.4932813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:36.4933133Z             op = silu_mul_quant
2025-05-07T20:31:36.4933387Z             if compiled:
2025-05-07T20:31:36.4933634Z                 op = torch.compile(op)
2025-05-07T20:31:36.4933936Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.4934211Z     
2025-05-07T20:31:36.4934405Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:36.4934578Z 
2025-05-07T20:31:36.4934684Z moe/activation_test.py:117: 
2025-05-07T20:31:36.4934980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.4935316Z moe/activation_test.py:115: in fn
2025-05-07T20:31:36.4935598Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:36.4936303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:36.4937007Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:36.4937548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:36.4938295Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:36.4938973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:36.4939518Z     kernel = self.compile(
2025-05-07T20:31:36.4940072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:36.4940742Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.4941233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:36.4941460Z 
2025-05-07T20:31:36.4941681Z self = <triton.compiler.compiler.ASTSource object at 0x7fc5170d66e0>
2025-05-07T20:31:36.4942769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:36.4944211Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc52c335b40>}
2025-05-07T20:31:36.4945586Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:36.4946626Z context = <triton._C.libtriton.ir.context object at 0x7fc51530f2f0>
2025-05-07T20:31:36.4946924Z 
2025-05-07T20:31:36.4947092Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:36.4947621Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.4948100Z                            module_map=module_map)
2025-05-07T20:31:36.4948473Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.4948835Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.4949095Z E       ^
2025-05-07T20:31:36.4949573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:36.4950029Z 
2025-05-07T20:31:36.4950533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:36.4951054Z 
2025-05-07T20:31:36.4951164Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:36.4951594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:36.4951998Z     T=1,
2025-05-07T20:31:36.4952182Z     D=5120,
2025-05-07T20:31:36.4952382Z     scale_ub=None,
2025-05-07T20:31:36.4952601Z     contiguous=True,
2025-05-07T20:31:36.4952823Z     compiled=True,
2025-05-07T20:31:36.4953029Z )
2025-05-07T20:31:36.9647791Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:36.9649961Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:36.9652710Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:36.9654471Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:36.9656278Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:36.9657704Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:36.9659129Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:36.9660542Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:36.9662160Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:36.9663436Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:36.9664684Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:36.9665926Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:36.9666993Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:36.9668032Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:36.9669277Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:36.9670586Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:36.9671842Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:36.9672919Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:36.9674117Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:36.9675506Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:36.9676592Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:36.9677532Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:36.9678290Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:36.9679335Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.1266917Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.1268000Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.1269366Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.1270810Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.1272374Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.1273832Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.1275165Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.1276572Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.1278019Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.1279285Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.1280521Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.1281868Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.1282927Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:37.1284019Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.1285252Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.1286561Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.1287705Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.1288765Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.1289968Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.1291339Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.1292420Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.1293354Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.1294112Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.1295257Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.5728543Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.5729637Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.5731020Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.5732485Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.5733918Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.5735349Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.5736686Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.5738330Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.5739805Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.5741096Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.5742354Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.5743635Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.5744721Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:37.5745782Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.5747042Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.5748365Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.5749512Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.5750595Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.5751925Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.5753318Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.5754451Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.5755389Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.5756380Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.5757436Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.6024961Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:37.6026036Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:31:37.6027398Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:37.6028983Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:37.6030395Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:37.6031799Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.6033125Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:37.6034523Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.6035967Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:37.6037237Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:31:37.6038481Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:37.6039706Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:31:37.6040761Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:37.6041915Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:31:37.6043154Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:37.6044455Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:37.6045588Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:37.6046646Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:31:37.6047861Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:37.6049245Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:37.6050323Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.6051242Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:37.6052073Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:31:37.6053113Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.9108137Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:37.9108705Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:37.9108979Z 
2025-05-07T20:31:37.9109065Z     @given(
2025-05-07T20:31:37.9109296Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:37.9109619Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:37.9109933Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:37.9110273Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:37.9110604Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:37.9110913Z     )
2025-05-07T20:31:37.9111284Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:37.9111736Z     def test_silu_mul_quant(
2025-05-07T20:31:37.9111980Z         self,
2025-05-07T20:31:37.9112175Z         T: int,
2025-05-07T20:31:37.9112376Z         D: int,
2025-05-07T20:31:37.9112594Z         scale_ub: Optional[float],
2025-05-07T20:31:37.9112869Z         contiguous: bool,
2025-05-07T20:31:37.9113111Z         compiled: bool,
2025-05-07T20:31:37.9113336Z     ) -> None:
2025-05-07T20:31:37.9113557Z         torch.manual_seed(2025)
2025-05-07T20:31:37.9113827Z     
2025-05-07T20:31:37.9114123Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:37.9114472Z     
2025-05-07T20:31:37.9114671Z         x_sign = torch.sign(x)
2025-05-07T20:31:37.9114969Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:37.9115277Z         x = x_sign * x_clamp
2025-05-07T20:31:37.9115521Z         x0 = x[:, :D]
2025-05-07T20:31:37.9115746Z         x1 = x[:, D:]
2025-05-07T20:31:37.9115953Z     
2025-05-07T20:31:37.9116143Z         if contiguous:
2025-05-07T20:31:37.9116386Z             x0 = x0.contiguous()
2025-05-07T20:31:37.9116818Z             x1 = x1.contiguous()
2025-05-07T20:31:37.9117062Z     
2025-05-07T20:31:37.9117257Z         if scale_ub is not None:
2025-05-07T20:31:37.9117531Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:37.9117875Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:37.9118187Z             )
2025-05-07T20:31:37.9118376Z         else:
2025-05-07T20:31:37.9118592Z             scale_ub_tensor = None
2025-05-07T20:31:37.9118851Z     
2025-05-07T20:31:37.9119082Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.9119401Z             op = silu_mul_quant
2025-05-07T20:31:37.9119653Z             if compiled:
2025-05-07T20:31:37.9119899Z                 op = torch.compile(op)
2025-05-07T20:31:37.9120200Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:37.9120483Z     
2025-05-07T20:31:37.9120680Z         y_fp8, y_scale = fn()
2025-05-07T20:31:37.9120967Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:37.9121265Z     
2025-05-07T20:31:37.9121509Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:37.9121844Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:37.9122140Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:37.9122458Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:37.9122813Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.9123130Z     
2025-05-07T20:31:37.9123329Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:37.9123531Z 
2025-05-07T20:31:37.9123633Z moe/activation_test.py:126: 
2025-05-07T20:31:37.9123929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.9124263Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:37.9124710Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:37.9125516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:37.9126281Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:37.9126829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:37.9127518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:37.9128217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:37.9128952Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.9129712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:37.9130484Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:37.9131225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:37.9131878Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:37.9132481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:37.9133005Z     fn()
2025-05-07T20:31:37.9133531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:37.9134155Z     self.fn.run(
2025-05-07T20:31:37.9134631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:37.9135171Z     kernel = self.compile(
2025-05-07T20:31:37.9135724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:37.9136388Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:37.9136785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:37.9137148Z 
2025-05-07T20:31:37.9137361Z self = <triton.compiler.compiler.ASTSource object at 0x7fc515a29780>
2025-05-07T20:31:37.9138537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:37.9139934Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc517344820>}
2025-05-07T20:31:37.9141300Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:37.9142340Z context = <triton._C.libtriton.ir.context object at 0x7fc514efdcb0>
2025-05-07T20:31:37.9142633Z 
2025-05-07T20:31:37.9142813Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:37.9143342Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:37.9143845Z                            module_map=module_map)
2025-05-07T20:31:37.9144245Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:37.9144608Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:37.9144871Z E       ^
2025-05-07T20:31:37.9145339Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:37.9145794Z 
2025-05-07T20:31:37.9146223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:37.9146741Z 
2025-05-07T20:31:37.9146930Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:37.9147355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:37.9147768Z     T=2048,
2025-05-07T20:31:37.9147964Z     D=5120,
2025-05-07T20:31:37.9148155Z     scale_ub=None,
2025-05-07T20:31:37.9148376Z     contiguous=True,
2025-05-07T20:31:37.9148604Z     compiled=True,
2025-05-07T20:31:37.9148809Z )
2025-05-07T20:31:38.3562518Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.3563992Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.3565373Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.3566829Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.3568237Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.3569651Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.3570985Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.3572400Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.3574052Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.3575323Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.3576565Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.3577802Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.3578958Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:38.3580010Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.3581249Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.3582554Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.3583813Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.3584930Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.3586137Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.3587514Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.3588599Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.3589530Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.3590294Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.3591334Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.5183617Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.5185533Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.5188252Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.5191162Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.5193736Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.5195144Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.5196465Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.5197874Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.5199320Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.5200604Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.5201854Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.5203085Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.5204338Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:38.5205392Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.5206634Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.5207944Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.5209084Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.5210157Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.5211363Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.5212754Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.5213828Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.5214811Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.5215570Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.5216611Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.9653190Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.9654299Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.9655959Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.9657432Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.9658896Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.9660312Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.9661648Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.9663221Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.9664726Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.9666003Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.9667251Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.9668481Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.9669552Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:38.9670595Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.9671845Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.9673149Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.9674323Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.9675415Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.9676621Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.9678126Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.9679204Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.9680138Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.9680896Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.9681943Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:38.9954793Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:38.9956004Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:31:38.9957367Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:38.9958959Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:38.9960378Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:38.9961795Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:38.9963126Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:38.9964590Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:38.9966040Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:38.9967318Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:31:38.9968565Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:38.9969801Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:31:38.9970858Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:38.9971900Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:31:38.9973280Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:38.9974644Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:38.9975777Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:38.9976850Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:31:38.9978117Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:38.9979509Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:38.9980588Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:38.9981515Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:38.9982272Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:31:38.9983883Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.4443956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:39.4444557Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:39.4444840Z 
2025-05-07T20:31:39.4444922Z     @given(
2025-05-07T20:31:39.4445160Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:39.4445478Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:39.4445795Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:39.4446142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:39.4446476Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:39.4446771Z     )
2025-05-07T20:31:39.4447130Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:39.4447592Z     def test_silu_mul_quant(
2025-05-07T20:31:39.4447835Z         self,
2025-05-07T20:31:39.4448039Z         T: int,
2025-05-07T20:31:39.4448242Z         D: int,
2025-05-07T20:31:39.4448470Z         scale_ub: Optional[float],
2025-05-07T20:31:39.4448752Z         contiguous: bool,
2025-05-07T20:31:39.4448996Z         compiled: bool,
2025-05-07T20:31:39.4449222Z     ) -> None:
2025-05-07T20:31:39.4449445Z         torch.manual_seed(2025)
2025-05-07T20:31:39.4449694Z     
2025-05-07T20:31:39.4449975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:39.4450329Z     
2025-05-07T20:31:39.4450531Z         x_sign = torch.sign(x)
2025-05-07T20:31:39.4450825Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:39.4451142Z         x = x_sign * x_clamp
2025-05-07T20:31:39.4451391Z         x0 = x[:, :D]
2025-05-07T20:31:39.4451609Z         x1 = x[:, D:]
2025-05-07T20:31:39.4451825Z     
2025-05-07T20:31:39.4452018Z         if contiguous:
2025-05-07T20:31:39.4452256Z             x0 = x0.contiguous()
2025-05-07T20:31:39.4452528Z             x1 = x1.contiguous()
2025-05-07T20:31:39.4452776Z     
2025-05-07T20:31:39.4452971Z         if scale_ub is not None:
2025-05-07T20:31:39.4453450Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:39.4453800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:39.4454117Z             )
2025-05-07T20:31:39.4454309Z         else:
2025-05-07T20:31:39.4454532Z             scale_ub_tensor = None
2025-05-07T20:31:39.4454792Z     
2025-05-07T20:31:39.4455028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.4455350Z             op = silu_mul_quant
2025-05-07T20:31:39.4455825Z             if compiled:
2025-05-07T20:31:39.4456081Z                 op = torch.compile(op)
2025-05-07T20:31:39.4456388Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:39.4456687Z     
2025-05-07T20:31:39.4456888Z         y_fp8, y_scale = fn()
2025-05-07T20:31:39.4457185Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:39.4457480Z     
2025-05-07T20:31:39.4457726Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:39.4458138Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:39.4458443Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:39.4458766Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:39.4459130Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:39.4459450Z     
2025-05-07T20:31:39.4459663Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:39.4459864Z 
2025-05-07T20:31:39.4459975Z moe/activation_test.py:126: 
2025-05-07T20:31:39.4460277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.4460616Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:39.4460961Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:39.4461946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:39.4462727Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:39.4463293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:39.4463992Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:39.4464696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:39.4465438Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:39.4466214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:39.4466978Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:39.4467727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:39.4468383Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:39.4469007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:39.4469531Z     fn()
2025-05-07T20:31:39.4470055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:39.4470650Z     self.fn.run(
2025-05-07T20:31:39.4471143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:39.4471683Z     kernel = self.compile(
2025-05-07T20:31:39.4472239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:39.4472913Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.4473320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:39.4473556Z 
2025-05-07T20:31:39.4473770Z self = <triton.compiler.compiler.ASTSource object at 0x7fc53d984070>
2025-05-07T20:31:39.4475003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:39.4476414Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc516e52a70>}
2025-05-07T20:31:39.4477785Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:39.4478832Z context = <triton._C.libtriton.ir.context object at 0x7fc514bd4970>
2025-05-07T20:31:39.4479134Z 
2025-05-07T20:31:39.4479307Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:39.4479842Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.4480333Z                            module_map=module_map)
2025-05-07T20:31:39.4480706Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.4481079Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:39.4481352Z E       ^
2025-05-07T20:31:39.4481825Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:39.4482290Z 
2025-05-07T20:31:39.4482714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:39.4483242Z 
2025-05-07T20:31:39.4483350Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:39.4483860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:39.4484297Z     T=128,
2025-05-07T20:31:39.4484521Z     D=5120,
2025-05-07T20:31:39.4484723Z     scale_ub=None,
2025-05-07T20:31:39.4484946Z     contiguous=True,
2025-05-07T20:31:39.4485182Z     compiled=True,
2025-05-07T20:31:39.4485394Z )
2025-05-07T20:31:39.9157020Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:39.9158152Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:39.9159528Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:39.9161024Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:39.9162429Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:39.9163848Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:39.9165228Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:39.9166628Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:39.9168066Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:39.9169608Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:39.9170840Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:39.9172067Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:39.9173124Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:39.9174158Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:39.9175394Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:39.9176694Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:39.9177829Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:39.9179116Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:39.9180320Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:39.9181706Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:39.9182784Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:39.9183712Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:39.9184516Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:39.9185549Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.0805061Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.0807100Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:40.0809612Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.0812325Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.0814834Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.0816513Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.0817857Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.0819368Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.0820826Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.0822109Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:40.0823359Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.0824593Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:40.0825799Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:40.0826844Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:40.0828096Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.0829400Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.0830549Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:40.0831622Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:40.0832828Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.0834240Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.0835359Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.0836294Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.0837051Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:40.0838102Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.5306396Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.5308558Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:40.5311303Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.5314224Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.5315809Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.5317233Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.5325819Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.5327248Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.5328873Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.5330157Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:40.5331397Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.5332630Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:40.5333689Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:40.5334782Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:40.5336030Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.5337325Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.5338529Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:40.5339592Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:40.5340795Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.5342304Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.5343379Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.5344307Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.5345113Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:40.5346155Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.5605308Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:40.5606405Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:31:40.5607767Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:40.5609216Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:40.5610768Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:40.5612186Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.5613516Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:40.5614924Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.5616373Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:40.5617646Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:31:40.5618991Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:40.5620228Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:31:40.5621278Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:40.5622327Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:31:40.5623573Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:40.5625010Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:40.5626147Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:40.5627215Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:31:40.5628421Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:40.5629820Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:40.5630909Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.5631845Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:40.5632608Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:31:40.5633733Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.9685311Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:40.9686526Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:40.9687229Z 
2025-05-07T20:31:40.9687393Z     @given(
2025-05-07T20:31:40.9687871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:40.9688497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:40.9689131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:40.9689813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:40.9690487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:40.9691061Z     )
2025-05-07T20:31:40.9691782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:40.9692688Z     def test_silu_mul_quant(
2025-05-07T20:31:40.9693172Z         self,
2025-05-07T20:31:40.9693570Z         T: int,
2025-05-07T20:31:40.9693981Z         D: int,
2025-05-07T20:31:40.9694421Z         scale_ub: Optional[float],
2025-05-07T20:31:40.9694822Z         contiguous: bool,
2025-05-07T20:31:40.9695106Z         compiled: bool,
2025-05-07T20:31:40.9695338Z     ) -> None:
2025-05-07T20:31:40.9695567Z         torch.manual_seed(2025)
2025-05-07T20:31:40.9695822Z     
2025-05-07T20:31:40.9696102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:40.9696457Z     
2025-05-07T20:31:40.9696664Z         x_sign = torch.sign(x)
2025-05-07T20:31:40.9696973Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:40.9697288Z         x = x_sign * x_clamp
2025-05-07T20:31:40.9697541Z         x0 = x[:, :D]
2025-05-07T20:31:40.9697773Z         x1 = x[:, D:]
2025-05-07T20:31:40.9697984Z     
2025-05-07T20:31:40.9698248Z         if contiguous:
2025-05-07T20:31:40.9698496Z             x0 = x0.contiguous()
2025-05-07T20:31:40.9698761Z             x1 = x1.contiguous()
2025-05-07T20:31:40.9699019Z     
2025-05-07T20:31:40.9699224Z         if scale_ub is not None:
2025-05-07T20:31:40.9699506Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:40.9700041Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:40.9700366Z             )
2025-05-07T20:31:40.9700564Z         else:
2025-05-07T20:31:40.9700789Z             scale_ub_tensor = None
2025-05-07T20:31:40.9701058Z     
2025-05-07T20:31:40.9701298Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.9701624Z             op = silu_mul_quant
2025-05-07T20:31:40.9701884Z             if compiled:
2025-05-07T20:31:40.9702143Z                 op = torch.compile(op)
2025-05-07T20:31:40.9702450Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:40.9702736Z     
2025-05-07T20:31:40.9702941Z         y_fp8, y_scale = fn()
2025-05-07T20:31:40.9703233Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:40.9703531Z     
2025-05-07T20:31:40.9703786Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:40.9704127Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:40.9704433Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:40.9704765Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:40.9705130Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.9705451Z     
2025-05-07T20:31:40.9705663Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:40.9705865Z 
2025-05-07T20:31:40.9705975Z moe/activation_test.py:126: 
2025-05-07T20:31:40.9706278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.9706621Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:40.9706964Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:40.9707770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:40.9708663Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:40.9709234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:40.9709942Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:40.9710648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:40.9711390Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.9712170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:40.9712943Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:40.9713687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:40.9714405Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:40.9715026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:40.9715560Z     fn()
2025-05-07T20:31:40.9716090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:40.9716694Z     self.fn.run(
2025-05-07T20:31:40.9717181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:40.9717724Z     kernel = self.compile(
2025-05-07T20:31:40.9718287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:40.9718963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:40.9719368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:40.9719607Z 
2025-05-07T20:31:40.9719827Z self = <triton.compiler.compiler.ASTSource object at 0x7fc516d3bc40>
2025-05-07T20:31:40.9720940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:40.9722473Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc514dd11b0>}
2025-05-07T20:31:40.9723856Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:40.9724958Z context = <triton._C.libtriton.ir.context object at 0x7fc5145af670>
2025-05-07T20:31:40.9725262Z 
2025-05-07T20:31:40.9725437Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:40.9725988Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:40.9726475Z                            module_map=module_map)
2025-05-07T20:31:40.9726861Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:40.9727238Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:40.9727518Z E       ^
2025-05-07T20:31:40.9727998Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:40.9728463Z 
2025-05-07T20:31:40.9728889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:40.9729415Z 
2025-05-07T20:31:40.9729525Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:40.9729954Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:40.9730364Z     T=4096,
2025-05-07T20:31:40.9730568Z     D=5120,
2025-05-07T20:31:40.9730854Z     scale_ub=None,
2025-05-07T20:31:40.9731083Z     contiguous=True,
2025-05-07T20:31:40.9731320Z     compiled=True,
2025-05-07T20:31:40.9731539Z )
2025-05-07T20:31:41.4459026Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:41.4460118Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:41.4461488Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:41.4462952Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:41.4464352Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:41.4465767Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.4467099Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:41.4468506Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.4469960Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:41.4471411Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:41.4472655Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:41.4473892Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:41.4475012Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:41.4476059Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:41.4477306Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:41.4478616Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:41.4479758Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:41.4480821Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:41.4482186Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:41.4483573Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:41.4484656Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.4485638Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.4486396Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:41.4487441Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:41.6091985Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:41.6094145Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:41.6095697Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:41.6097145Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:41.6098620Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:41.6100210Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:41.6101539Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:41.6102959Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:41.6104423Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:41.6105753Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:41.6106996Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:41.6108233Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:41.6109295Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:41.6110456Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:41.6111699Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:41.6113013Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:41.6114160Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:41.6115226Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:41.6116431Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:41.6117809Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:41.6118894Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:41.6119828Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:41.6120585Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:41.6121630Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.0571401Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:42.0572698Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:42.0574068Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:42.0575577Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:42.0576994Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:42.0578492Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.0579829Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:42.0581237Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.0582839Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:42.0584114Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:42.0585372Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:42.0586600Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:42.0587664Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:42.0588719Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:42.0589969Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:42.0591275Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:42.0592418Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:42.0593481Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:42.0594704Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:42.0596121Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:42.0597284Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.0598214Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.0598975Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:42.0600018Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.0869599Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:42.0870688Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:31:42.0872049Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:42.0873505Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:42.0875138Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:42.0876561Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.0877905Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:42.0879314Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.0880775Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:42.0882056Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:31:42.0883315Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:42.0884559Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:31:42.0885621Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:42.0886665Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:31:42.0887919Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:42.0889356Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:42.0890497Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:42.0891568Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:31:42.0892780Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:42.0894178Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:42.0895326Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.0896259Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.0897021Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:31:42.0898146Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.4952318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.4953113Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:42.4953468Z 
2025-05-07T20:31:42.4953554Z     @given(
2025-05-07T20:31:42.4953805Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.4954131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.4954497Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.4954855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.4955204Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.4955498Z     )
2025-05-07T20:31:42.4956174Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.4956639Z     def test_silu_mul_quant(
2025-05-07T20:31:42.4956894Z         self,
2025-05-07T20:31:42.4957104Z         T: int,
2025-05-07T20:31:42.4957318Z         D: int,
2025-05-07T20:31:42.4957547Z         scale_ub: Optional[float],
2025-05-07T20:31:42.4957852Z         contiguous: bool,
2025-05-07T20:31:42.4958115Z         compiled: bool,
2025-05-07T20:31:42.4958352Z     ) -> None:
2025-05-07T20:31:42.4958598Z         torch.manual_seed(2025)
2025-05-07T20:31:42.4958855Z     
2025-05-07T20:31:42.4959140Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.4959503Z     
2025-05-07T20:31:42.4959712Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.4960023Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.4960344Z         x = x_sign * x_clamp
2025-05-07T20:31:42.4960600Z         x0 = x[:, :D]
2025-05-07T20:31:42.4960831Z         x1 = x[:, D:]
2025-05-07T20:31:42.4961048Z     
2025-05-07T20:31:42.4961248Z         if contiguous:
2025-05-07T20:31:42.4961497Z             x0 = x0.contiguous()
2025-05-07T20:31:42.4961764Z             x1 = x1.contiguous()
2025-05-07T20:31:42.4962016Z     
2025-05-07T20:31:42.4962222Z         if scale_ub is not None:
2025-05-07T20:31:42.4962510Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.4962864Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.4963186Z             )
2025-05-07T20:31:42.4963627Z         else:
2025-05-07T20:31:42.4963852Z             scale_ub_tensor = None
2025-05-07T20:31:42.4964121Z     
2025-05-07T20:31:42.4964391Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.4964755Z             op = silu_mul_quant
2025-05-07T20:31:42.4965021Z             if compiled:
2025-05-07T20:31:42.4965283Z                 op = torch.compile(op)
2025-05-07T20:31:42.4965597Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.4965885Z     
2025-05-07T20:31:42.4966091Z         y_fp8, y_scale = fn()
2025-05-07T20:31:42.4966387Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:42.4966699Z     
2025-05-07T20:31:42.4966954Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.4967301Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:42.4967615Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:42.4967946Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:42.4968324Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:42.4968649Z     
2025-05-07T20:31:42.4968869Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:42.4969074Z 
2025-05-07T20:31:42.4969192Z moe/activation_test.py:126: 
2025-05-07T20:31:42.4969504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4969856Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:42.4970207Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:42.4971030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:42.4971816Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:42.4972519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.4973232Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.4973948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:42.4974699Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:42.4975479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:42.4976245Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:42.4977001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:42.4977664Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:42.4978417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:42.4978950Z     fn()
2025-05-07T20:31:42.4979481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:42.4980087Z     self.fn.run(
2025-05-07T20:31:42.4980577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.4981120Z     kernel = self.compile(
2025-05-07T20:31:42.4981681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.4982362Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.4982765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.4983003Z 
2025-05-07T20:31:42.4983220Z self = <triton.compiler.compiler.ASTSource object at 0x7fc51521d780>
2025-05-07T20:31:42.4984367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.4986002Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc516e51d80>}
2025-05-07T20:31:42.4987395Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.4988460Z context = <triton._C.libtriton.ir.context object at 0x7fc50ff5ea30>
2025-05-07T20:31:42.4988760Z 
2025-05-07T20:31:42.4988936Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.4989486Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.4989978Z                            module_map=module_map)
2025-05-07T20:31:42.4990359Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.4990748Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:42.4991036Z E       ^
2025-05-07T20:31:42.4991517Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.4991993Z 
2025-05-07T20:31:42.4992424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.4992961Z 
2025-05-07T20:31:42.4993076Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.4993517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.4993928Z     T=16384,
2025-05-07T20:31:42.4994141Z     D=5120,
2025-05-07T20:31:42.4994355Z     scale_ub=None,
2025-05-07T20:31:42.4994606Z     contiguous=True,
2025-05-07T20:31:42.4994957Z     compiled=True,
2025-05-07T20:31:42.4995188Z )
2025-05-07T20:31:42.5401587Z W0507 20:31:42.538000 86695 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:31:42.5403165Z W0507 20:31:42.538000 86695 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:31:42.5404601Z W0507 20:31:42.538000 86695 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:31:42.5405627Z W0507 20:31:42.538000 86695 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:31:42.5406775Z W0507 20:31:42.538000 86695 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:31:42.6428224Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.6429024Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:42.6429416Z 
2025-05-07T20:31:42.6429573Z     @given(
2025-05-07T20:31:42.6429834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.6430171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.6430499Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.6430838Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.6431202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.6431508Z     )
2025-05-07T20:31:42.6431879Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.6432331Z     def test_silu_mul_quant(
2025-05-07T20:31:42.6441165Z         self,
2025-05-07T20:31:42.6441392Z         T: int,
2025-05-07T20:31:42.6441612Z         D: int,
2025-05-07T20:31:42.6441845Z         scale_ub: Optional[float],
2025-05-07T20:31:42.6442142Z         contiguous: bool,
2025-05-07T20:31:42.6442756Z         compiled: bool,
2025-05-07T20:31:42.6442997Z     ) -> None:
2025-05-07T20:31:42.6443225Z         torch.manual_seed(2025)
2025-05-07T20:31:42.6443472Z     
2025-05-07T20:31:42.6443768Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.6444127Z     
2025-05-07T20:31:42.6444333Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.6444687Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.6445016Z         x = x_sign * x_clamp
2025-05-07T20:31:42.6445262Z         x0 = x[:, :D]
2025-05-07T20:31:42.6445494Z         x1 = x[:, D:]
2025-05-07T20:31:42.6445721Z     
2025-05-07T20:31:42.6445912Z         if contiguous:
2025-05-07T20:31:42.6446164Z             x0 = x0.contiguous()
2025-05-07T20:31:42.6446444Z             x1 = x1.contiguous()
2025-05-07T20:31:42.6446707Z     
2025-05-07T20:31:42.6446904Z         if scale_ub is not None:
2025-05-07T20:31:42.6447193Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.6447555Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.6447872Z             )
2025-05-07T20:31:42.6448076Z         else:
2025-05-07T20:31:42.6448302Z             scale_ub_tensor = None
2025-05-07T20:31:42.6448558Z     
2025-05-07T20:31:42.6448807Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.6449141Z             op = silu_mul_quant
2025-05-07T20:31:42.6449396Z             if compiled:
2025-05-07T20:31:42.6449661Z                 op = torch.compile(op)
2025-05-07T20:31:42.6449978Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.6450257Z     
2025-05-07T20:31:42.6450463Z         y_fp8, y_scale = fn()
2025-05-07T20:31:42.6450768Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:42.6451066Z     
2025-05-07T20:31:42.6451498Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.6451850Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:42.6452156Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:42.6452483Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:42.6452858Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:42.6453183Z     
2025-05-07T20:31:42.6453391Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:42.6453609Z 
2025-05-07T20:31:42.6453719Z moe/activation_test.py:126: 
2025-05-07T20:31:42.6454030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.6454403Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:42.6454783Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:42.6455908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:42.6456703Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:42.6457262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.6457971Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.6458787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:42.6459534Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:42.6460295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:42.6461061Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:42.6461806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:42.6462470Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:42.6463082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:42.6463753Z     fn()
2025-05-07T20:31:42.6464279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:42.6464918Z     self.fn.run(
2025-05-07T20:31:42.6465407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.6465957Z     kernel = self.compile(
2025-05-07T20:31:42.6466518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.6467186Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.6467596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.6467826Z 
2025-05-07T20:31:42.6468058Z self = <triton.compiler.compiler.ASTSource object at 0x7fc5148f99f0>
2025-05-07T20:31:42.6469167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.6470584Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc51510fe20>}
2025-05-07T20:31:42.6471947Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.6472998Z context = <triton._C.libtriton.ir.context object at 0x7fc50f9a1730>
2025-05-07T20:31:42.6473292Z 
2025-05-07T20:31:42.6473475Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.6474125Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.6474658Z                            module_map=module_map)
2025-05-07T20:31:42.6475044Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.6475422Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:42.6475697Z E       ^
2025-05-07T20:31:42.6476176Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.6476637Z 
2025-05-07T20:31:42.6477071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.6477594Z 
2025-05-07T20:31:42.6477714Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.6478137Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.6478553Z     T=1,
2025-05-07T20:31:42.6478752Z     D=5120,
2025-05-07T20:31:42.6478959Z     scale_ub=1200.0,
2025-05-07T20:31:42.6479197Z     contiguous=True,
2025-05-07T20:31:42.6479432Z     compiled=True,
2025-05-07T20:31:42.6479644Z )
2025-05-07T20:31:42.9907317Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:42.9908073Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:42.9908435Z 
2025-05-07T20:31:42.9908521Z     @given(
2025-05-07T20:31:42.9908773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:42.9909106Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:42.9909425Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:42.9909775Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:42.9910122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:42.9910416Z     )
2025-05-07T20:31:42.9910785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:42.9911269Z     def test_silu_mul_quant(
2025-05-07T20:31:42.9911528Z         self,
2025-05-07T20:31:42.9911730Z         T: int,
2025-05-07T20:31:42.9911941Z         D: int,
2025-05-07T20:31:42.9912550Z         scale_ub: Optional[float],
2025-05-07T20:31:42.9912830Z         contiguous: bool,
2025-05-07T20:31:42.9913083Z         compiled: bool,
2025-05-07T20:31:42.9913330Z     ) -> None:
2025-05-07T20:31:42.9913552Z         torch.manual_seed(2025)
2025-05-07T20:31:42.9913810Z     
2025-05-07T20:31:42.9914101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:42.9914454Z     
2025-05-07T20:31:42.9914685Z         x_sign = torch.sign(x)
2025-05-07T20:31:42.9915017Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:42.9915334Z         x = x_sign * x_clamp
2025-05-07T20:31:42.9915587Z         x0 = x[:, :D]
2025-05-07T20:31:42.9915818Z         x1 = x[:, D:]
2025-05-07T20:31:42.9916030Z     
2025-05-07T20:31:42.9916232Z         if contiguous:
2025-05-07T20:31:42.9916485Z             x0 = x0.contiguous()
2025-05-07T20:31:42.9916749Z             x1 = x1.contiguous()
2025-05-07T20:31:42.9917006Z     
2025-05-07T20:31:42.9917215Z         if scale_ub is not None:
2025-05-07T20:31:42.9917512Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:42.9917859Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:42.9918182Z             )
2025-05-07T20:31:42.9918390Z         else:
2025-05-07T20:31:42.9918607Z             scale_ub_tensor = None
2025-05-07T20:31:42.9918879Z     
2025-05-07T20:31:42.9919132Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:42.9919456Z             op = silu_mul_quant
2025-05-07T20:31:42.9919721Z             if compiled:
2025-05-07T20:31:42.9919989Z                 op = torch.compile(op)
2025-05-07T20:31:42.9920296Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.9920584Z     
2025-05-07T20:31:42.9920794Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:42.9920965Z 
2025-05-07T20:31:42.9921762Z moe/activation_test.py:117: 
2025-05-07T20:31:42.9922087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.9922443Z moe/activation_test.py:115: in fn
2025-05-07T20:31:42.9922741Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:42.9923316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:42.9923900Z     return fn(*args, **kwargs)
2025-05-07T20:31:42.9924587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:42.9925297Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:42.9925855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:42.9926564Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:42.9927264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:42.9927806Z     kernel = self.compile(
2025-05-07T20:31:42.9928367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:42.9929055Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:42.9929468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:42.9929704Z 
2025-05-07T20:31:42.9929919Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50fc2ac80>
2025-05-07T20:31:42.9931034Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:42.9932468Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50fdb36d0>}
2025-05-07T20:31:42.9933855Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:42.9935046Z context = <triton._C.libtriton.ir.context object at 0x7fc50f4c9870>
2025-05-07T20:31:42.9935351Z 
2025-05-07T20:31:42.9935526Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:42.9936070Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:42.9936563Z                            module_map=module_map)
2025-05-07T20:31:42.9936938Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:42.9937312Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:42.9937588Z E       ^
2025-05-07T20:31:42.9938180Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:42.9938656Z 
2025-05-07T20:31:42.9939085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:42.9939623Z 
2025-05-07T20:31:42.9939733Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:42.9940167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:42.9940576Z     T=1,
2025-05-07T20:31:42.9940775Z     D=5120,
2025-05-07T20:31:42.9940985Z     scale_ub=None,
2025-05-07T20:31:42.9941211Z     contiguous=False,
2025-05-07T20:31:42.9941451Z     compiled=True,
2025-05-07T20:31:42.9941685Z )
2025-05-07T20:31:43.0623721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.0624513Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.0624927Z 
2025-05-07T20:31:43.0625040Z     @given(
2025-05-07T20:31:43.0625725Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.0626168Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.0626576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.0626964Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.0627304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.0627607Z     )
2025-05-07T20:31:43.0627975Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.0628427Z     def test_silu_mul_quant(
2025-05-07T20:31:43.0628684Z         self,
2025-05-07T20:31:43.0628894Z         T: int,
2025-05-07T20:31:43.0629099Z         D: int,
2025-05-07T20:31:43.0629333Z         scale_ub: Optional[float],
2025-05-07T20:31:43.0629622Z         contiguous: bool,
2025-05-07T20:31:43.0629870Z         compiled: bool,
2025-05-07T20:31:43.0630109Z     ) -> None:
2025-05-07T20:31:43.0630340Z         torch.manual_seed(2025)
2025-05-07T20:31:43.0630590Z     
2025-05-07T20:31:43.0630889Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.0631250Z     
2025-05-07T20:31:43.0631460Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.0631768Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.0632095Z         x = x_sign * x_clamp
2025-05-07T20:31:43.0632349Z         x0 = x[:, :D]
2025-05-07T20:31:43.0632578Z         x1 = x[:, D:]
2025-05-07T20:31:43.0632800Z     
2025-05-07T20:31:43.0633001Z         if contiguous:
2025-05-07T20:31:43.0633236Z             x0 = x0.contiguous()
2025-05-07T20:31:43.0633503Z             x1 = x1.contiguous()
2025-05-07T20:31:43.0633757Z     
2025-05-07T20:31:43.0633954Z         if scale_ub is not None:
2025-05-07T20:31:43.0634244Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.0634599Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.0634959Z             )
2025-05-07T20:31:43.0635166Z         else:
2025-05-07T20:31:43.0635399Z             scale_ub_tensor = None
2025-05-07T20:31:43.0635662Z     
2025-05-07T20:31:43.0635912Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.0636241Z             op = silu_mul_quant
2025-05-07T20:31:43.0636663Z             if compiled:
2025-05-07T20:31:43.0636919Z                 op = torch.compile(op)
2025-05-07T20:31:43.0637232Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.0637520Z     
2025-05-07T20:31:43.0637717Z         y_fp8, y_scale = fn()
2025-05-07T20:31:43.0638020Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:43.0638324Z     
2025-05-07T20:31:43.0638571Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.0638924Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:43.0639236Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:43.0639558Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:43.0639931Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:43.0640265Z     
2025-05-07T20:31:43.0640476Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:43.0640688Z 
2025-05-07T20:31:43.0640794Z moe/activation_test.py:126: 
2025-05-07T20:31:43.0641110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0641460Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:43.0641798Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:43.0642609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:43.0643382Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:43.0643941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.0644645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.0645444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:43.0646190Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:43.0646961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:43.0647729Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:43.0648477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:43.0649137Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:43.0649747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:43.0650283Z     fn()
2025-05-07T20:31:43.0650816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:43.0651417Z     self.fn.run(
2025-05-07T20:31:43.0651894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.0652448Z     kernel = self.compile(
2025-05-07T20:31:43.0653007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.0653674Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.0654081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.0654310Z 
2025-05-07T20:31:43.0654553Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50fc2a620>
2025-05-07T20:31:43.0656125Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.0657568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc51510fac0>}
2025-05-07T20:31:43.0659175Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.0660240Z context = <triton._C.libtriton.ir.context object at 0x7fc50f40cab0>
2025-05-07T20:31:43.0660536Z 
2025-05-07T20:31:43.0660713Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.0661241Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.0661724Z                            module_map=module_map)
2025-05-07T20:31:43.0662100Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.0662471Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:43.0662749Z E       ^
2025-05-07T20:31:43.0663225Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.0663688Z 
2025-05-07T20:31:43.0664117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.0664637Z 
2025-05-07T20:31:43.0664751Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.0665170Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.0665583Z     T=1,
2025-05-07T20:31:43.0665779Z     D=5120,
2025-05-07T20:31:43.0665976Z     scale_ub=None,
2025-05-07T20:31:43.0666205Z     contiguous=True,
2025-05-07T20:31:43.0666441Z     compiled=False,
2025-05-07T20:31:43.0666653Z )
2025-05-07T20:31:43.2319263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.2319977Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:43.2320599Z 
2025-05-07T20:31:43.2320685Z     @given(
2025-05-07T20:31:43.2320930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.2321259Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.2321579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.2321923Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.2322265Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.2322553Z     )
2025-05-07T20:31:43.2322913Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.2323364Z     def test_silu_mul_quant(
2025-05-07T20:31:43.2323614Z         self,
2025-05-07T20:31:43.2323818Z         T: int,
2025-05-07T20:31:43.2324027Z         D: int,
2025-05-07T20:31:43.2324250Z         scale_ub: Optional[float],
2025-05-07T20:31:43.2324534Z         contiguous: bool,
2025-05-07T20:31:43.2324781Z         compiled: bool,
2025-05-07T20:31:43.2325009Z     ) -> None:
2025-05-07T20:31:43.2325241Z         torch.manual_seed(2025)
2025-05-07T20:31:43.2325495Z     
2025-05-07T20:31:43.2325770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.2326125Z     
2025-05-07T20:31:43.2326328Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.2326628Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.2326939Z         x = x_sign * x_clamp
2025-05-07T20:31:43.2327186Z         x0 = x[:, :D]
2025-05-07T20:31:43.2327408Z         x1 = x[:, D:]
2025-05-07T20:31:43.2327617Z     
2025-05-07T20:31:43.2327811Z         if contiguous:
2025-05-07T20:31:43.2328048Z             x0 = x0.contiguous()
2025-05-07T20:31:43.2328305Z             x1 = x1.contiguous()
2025-05-07T20:31:43.2328558Z     
2025-05-07T20:31:43.2328758Z         if scale_ub is not None:
2025-05-07T20:31:43.2329034Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.2329380Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.2329701Z             )
2025-05-07T20:31:43.2329895Z         else:
2025-05-07T20:31:43.2330114Z             scale_ub_tensor = None
2025-05-07T20:31:43.2330376Z     
2025-05-07T20:31:43.2330773Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.2331094Z             op = silu_mul_quant
2025-05-07T20:31:43.2331353Z             if compiled:
2025-05-07T20:31:43.2331603Z                 op = torch.compile(op)
2025-05-07T20:31:43.2331918Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2332203Z     
2025-05-07T20:31:43.2332408Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.2332576Z 
2025-05-07T20:31:43.2332680Z moe/activation_test.py:117: 
2025-05-07T20:31:43.2332985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2333321Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.2333610Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2334326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.2335084Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.2335641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.2336332Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.2337011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.2337554Z     kernel = self.compile(
2025-05-07T20:31:43.2338189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.2338863Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.2339267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2339497Z 
2025-05-07T20:31:43.2339799Z self = <triton.compiler.compiler.ASTSource object at 0x7fc5148f9870>
2025-05-07T20:31:43.2340896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.2342311Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50fdb3760>}
2025-05-07T20:31:43.2343683Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.2344766Z context = <triton._C.libtriton.ir.context object at 0x7fc50f802d70>
2025-05-07T20:31:43.2345078Z 
2025-05-07T20:31:43.2345262Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.2345794Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.2346273Z                            module_map=module_map)
2025-05-07T20:31:43.2346654Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.2347015Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.2347290Z E       ^
2025-05-07T20:31:43.2347773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.2348231Z 
2025-05-07T20:31:43.2348662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.2349180Z 
2025-05-07T20:31:43.2349287Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.2349717Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.2350123Z     T=128,
2025-05-07T20:31:43.2350315Z     D=5120,
2025-05-07T20:31:43.2350521Z     scale_ub=None,
2025-05-07T20:31:43.2350743Z     contiguous=False,
2025-05-07T20:31:43.2350972Z     compiled=True,
2025-05-07T20:31:43.2351185Z )
2025-05-07T20:31:43.2351520Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.2352152Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:43.2352424Z 
2025-05-07T20:31:43.2352503Z     @given(
2025-05-07T20:31:43.2352743Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.2353064Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.2353373Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.2353715Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.2354063Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.2354351Z     )
2025-05-07T20:31:43.2354714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.2355165Z     def test_silu_mul_quant(
2025-05-07T20:31:43.2355419Z         self,
2025-05-07T20:31:43.2356044Z         T: int,
2025-05-07T20:31:43.2356251Z         D: int,
2025-05-07T20:31:43.2356492Z         scale_ub: Optional[float],
2025-05-07T20:31:43.2356780Z         contiguous: bool,
2025-05-07T20:31:43.2357032Z         compiled: bool,
2025-05-07T20:31:43.2357265Z     ) -> None:
2025-05-07T20:31:43.2357485Z         torch.manual_seed(2025)
2025-05-07T20:31:43.2366814Z     
2025-05-07T20:31:43.2367144Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.2367499Z     
2025-05-07T20:31:43.2367705Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.2368000Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.2368321Z         x = x_sign * x_clamp
2025-05-07T20:31:43.2368571Z         x0 = x[:, :D]
2025-05-07T20:31:43.2368785Z         x1 = x[:, D:]
2025-05-07T20:31:43.2368998Z     
2025-05-07T20:31:43.2369192Z         if contiguous:
2025-05-07T20:31:43.2369426Z             x0 = x0.contiguous()
2025-05-07T20:31:43.2369879Z             x1 = x1.contiguous()
2025-05-07T20:31:43.2370135Z     
2025-05-07T20:31:43.2370325Z         if scale_ub is not None:
2025-05-07T20:31:43.2370613Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.2370960Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.2371280Z             )
2025-05-07T20:31:43.2371469Z         else:
2025-05-07T20:31:43.2371682Z             scale_ub_tensor = None
2025-05-07T20:31:43.2371939Z     
2025-05-07T20:31:43.2372170Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.2372490Z             op = silu_mul_quant
2025-05-07T20:31:43.2372750Z             if compiled:
2025-05-07T20:31:43.2372998Z                 op = torch.compile(op)
2025-05-07T20:31:43.2373306Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2373585Z     
2025-05-07T20:31:43.2373778Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.2373957Z 
2025-05-07T20:31:43.2374059Z moe/activation_test.py:117: 
2025-05-07T20:31:43.2374370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2374758Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.2375049Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.2375628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.2376205Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.2376869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.2377576Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.2378220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.2378922Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.2379598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.2380142Z     kernel = self.compile(
2025-05-07T20:31:43.2380701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.2381507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.2381916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.2382152Z 
2025-05-07T20:31:43.2382365Z self = <triton.compiler.compiler.ASTSource object at 0x7fc514da58d0>
2025-05-07T20:31:43.2383470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.2384870Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc51444ee60>}
2025-05-07T20:31:43.2386243Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.2387297Z context = <triton._C.libtriton.ir.context object at 0x7fc50f8a26b0>
2025-05-07T20:31:43.2387591Z 
2025-05-07T20:31:43.2387770Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.2388310Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.2388785Z                            module_map=module_map)
2025-05-07T20:31:43.2389155Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.2389511Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.2389772Z E       ^
2025-05-07T20:31:43.2390325Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.2390785Z 
2025-05-07T20:31:43.2391208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.2391741Z 
2025-05-07T20:31:43.2391848Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.2392271Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.2392679Z     T=128,
2025-05-07T20:31:43.2392865Z     D=7168,
2025-05-07T20:31:43.2393062Z     scale_ub=1200.0,
2025-05-07T20:31:43.2393292Z     contiguous=False,
2025-05-07T20:31:43.2393517Z     compiled=False,
2025-05-07T20:31:43.2393727Z )
2025-05-07T20:31:43.3657680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.3658498Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:43.3658835Z 
2025-05-07T20:31:43.3658926Z     @given(
2025-05-07T20:31:43.3659188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.3659516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.3659833Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.3660179Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.3660522Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.3660818Z     )
2025-05-07T20:31:43.3661176Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.3661633Z     def test_silu_mul_quant(
2025-05-07T20:31:43.3661884Z         self,
2025-05-07T20:31:43.3662082Z         T: int,
2025-05-07T20:31:43.3662290Z         D: int,
2025-05-07T20:31:43.3662519Z         scale_ub: Optional[float],
2025-05-07T20:31:43.3662805Z         contiguous: bool,
2025-05-07T20:31:43.3663052Z         compiled: bool,
2025-05-07T20:31:43.3663287Z     ) -> None:
2025-05-07T20:31:43.3663513Z         torch.manual_seed(2025)
2025-05-07T20:31:43.3663764Z     
2025-05-07T20:31:43.3664062Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.3664424Z     
2025-05-07T20:31:43.3664647Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.3665271Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.3665590Z         x = x_sign * x_clamp
2025-05-07T20:31:43.3665843Z         x0 = x[:, :D]
2025-05-07T20:31:43.3666074Z         x1 = x[:, D:]
2025-05-07T20:31:43.3666291Z     
2025-05-07T20:31:43.3666485Z         if contiguous:
2025-05-07T20:31:43.3666730Z             x0 = x0.contiguous()
2025-05-07T20:31:43.3667003Z             x1 = x1.contiguous()
2025-05-07T20:31:43.3667251Z     
2025-05-07T20:31:43.3667454Z         if scale_ub is not None:
2025-05-07T20:31:43.3667744Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.3668088Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.3668411Z             )
2025-05-07T20:31:43.3668615Z         else:
2025-05-07T20:31:43.3668839Z             scale_ub_tensor = None
2025-05-07T20:31:43.3669102Z     
2025-05-07T20:31:43.3669354Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.3669682Z             op = silu_mul_quant
2025-05-07T20:31:43.3669946Z             if compiled:
2025-05-07T20:31:43.3670207Z                 op = torch.compile(op)
2025-05-07T20:31:43.3670521Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.3670798Z     
2025-05-07T20:31:43.3671003Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.3671178Z 
2025-05-07T20:31:43.3671292Z moe/activation_test.py:117: 
2025-05-07T20:31:43.3671594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.3671933Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.3672229Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.3672942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.3673647Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.3674347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.3675112Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.3675791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.3676340Z     kernel = self.compile(
2025-05-07T20:31:43.3676903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.3677580Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.3677986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.3678222Z 
2025-05-07T20:31:43.3678438Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50fcbd150>
2025-05-07T20:31:43.3679550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.3680972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc51444dab0>}
2025-05-07T20:31:43.3682339Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.3683390Z context = <triton._C.libtriton.ir.context object at 0x7fc50f8a9bb0>
2025-05-07T20:31:43.3683693Z 
2025-05-07T20:31:43.3683867Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.3684408Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.3684895Z                            module_map=module_map)
2025-05-07T20:31:43.3685276Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.3685649Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.3686079Z E       ^
2025-05-07T20:31:43.3686555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.3687022Z 
2025-05-07T20:31:43.3687449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.3687975Z 
2025-05-07T20:31:43.3688093Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.3688527Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.3688940Z     T=128,
2025-05-07T20:31:43.3689144Z     D=5120,
2025-05-07T20:31:43.3689350Z     scale_ub=None,
2025-05-07T20:31:43.3689575Z     contiguous=False,
2025-05-07T20:31:43.3689817Z     compiled=False,
2025-05-07T20:31:43.3690041Z )
2025-05-07T20:31:43.3690376Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.3690890Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:43.3691173Z 
2025-05-07T20:31:43.3691261Z     @given(
2025-05-07T20:31:43.3691498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.3691828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.3692152Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.3692496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.3692837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.3693139Z     )
2025-05-07T20:31:43.3693508Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.3693958Z     def test_silu_mul_quant(
2025-05-07T20:31:43.3694215Z         self,
2025-05-07T20:31:43.3694427Z         T: int,
2025-05-07T20:31:43.3694656Z         D: int,
2025-05-07T20:31:43.3694996Z         scale_ub: Optional[float],
2025-05-07T20:31:43.3695284Z         contiguous: bool,
2025-05-07T20:31:43.3695531Z         compiled: bool,
2025-05-07T20:31:43.3695776Z     ) -> None:
2025-05-07T20:31:43.3696005Z         torch.manual_seed(2025)
2025-05-07T20:31:43.3696253Z     
2025-05-07T20:31:43.3696539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.3696893Z     
2025-05-07T20:31:43.3697096Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.3697408Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.3697734Z         x = x_sign * x_clamp
2025-05-07T20:31:43.3697989Z         x0 = x[:, :D]
2025-05-07T20:31:43.3698311Z         x1 = x[:, D:]
2025-05-07T20:31:43.3698532Z     
2025-05-07T20:31:43.3698732Z         if contiguous:
2025-05-07T20:31:43.3698985Z             x0 = x0.contiguous()
2025-05-07T20:31:43.3699260Z             x1 = x1.contiguous()
2025-05-07T20:31:43.3699507Z     
2025-05-07T20:31:43.3699726Z         if scale_ub is not None:
2025-05-07T20:31:43.3700017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.3700373Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.3700696Z             )
2025-05-07T20:31:43.3700903Z         else:
2025-05-07T20:31:43.3701127Z             scale_ub_tensor = None
2025-05-07T20:31:43.3701391Z     
2025-05-07T20:31:43.3701640Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.3701969Z             op = silu_mul_quant
2025-05-07T20:31:43.3702230Z             if compiled:
2025-05-07T20:31:43.3702502Z                 op = torch.compile(op)
2025-05-07T20:31:43.3702814Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.3703095Z     
2025-05-07T20:31:43.3703307Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.3703480Z 
2025-05-07T20:31:43.3703590Z moe/activation_test.py:117: 
2025-05-07T20:31:43.3703891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.3704233Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.3704531Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.3705298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.3706126Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.3706683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.3707385Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.3708061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.3708610Z     kernel = self.compile(
2025-05-07T20:31:43.3709170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.3709853Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.3710255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.3710494Z 
2025-05-07T20:31:43.3710715Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50fd17f10>
2025-05-07T20:31:43.3711824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.3713236Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc514dd2950>}
2025-05-07T20:31:43.3714611Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.3715737Z context = <triton._C.libtriton.ir.context object at 0x7fc50f5f80b0>
2025-05-07T20:31:43.3716043Z 
2025-05-07T20:31:43.3716221Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.3716767Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.3717247Z                            module_map=module_map)
2025-05-07T20:31:43.3717626Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.3717994Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.3718267Z E       ^
2025-05-07T20:31:43.3718743Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.3719210Z 
2025-05-07T20:31:43.3719637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.3720159Z 
2025-05-07T20:31:43.3720279Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.3720720Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.3721131Z     T=128,
2025-05-07T20:31:43.3721331Z     D=5120,
2025-05-07T20:31:43.3721544Z     scale_ub=1200.0,
2025-05-07T20:31:43.3721778Z     contiguous=True,
2025-05-07T20:31:43.3722014Z     compiled=False,
2025-05-07T20:31:43.3722233Z )
2025-05-07T20:31:43.5659586Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.5660213Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:43.5660611Z 
2025-05-07T20:31:43.5660698Z     @given(
2025-05-07T20:31:43.5660935Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.5661253Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.5661572Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.5661917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.5662360Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.5662679Z     )
2025-05-07T20:31:43.5663042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.5663843Z     def test_silu_mul_quant(
2025-05-07T20:31:43.5664094Z         self,
2025-05-07T20:31:43.5664296Z         T: int,
2025-05-07T20:31:43.5664500Z         D: int,
2025-05-07T20:31:43.5664721Z         scale_ub: Optional[float],
2025-05-07T20:31:43.5664999Z         contiguous: bool,
2025-05-07T20:31:43.5665245Z         compiled: bool,
2025-05-07T20:31:43.5665517Z     ) -> None:
2025-05-07T20:31:43.5665738Z         torch.manual_seed(2025)
2025-05-07T20:31:43.5665986Z     
2025-05-07T20:31:43.5666270Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.5666622Z     
2025-05-07T20:31:43.5666816Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.5667118Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.5667437Z         x = x_sign * x_clamp
2025-05-07T20:31:43.5667686Z         x0 = x[:, :D]
2025-05-07T20:31:43.5667908Z         x1 = x[:, D:]
2025-05-07T20:31:43.5668123Z     
2025-05-07T20:31:43.5668309Z         if contiguous:
2025-05-07T20:31:43.5668558Z             x0 = x0.contiguous()
2025-05-07T20:31:43.5668823Z             x1 = x1.contiguous()
2025-05-07T20:31:43.5669064Z     
2025-05-07T20:31:43.5669263Z         if scale_ub is not None:
2025-05-07T20:31:43.5669546Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.5669887Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.5670202Z             )
2025-05-07T20:31:43.5670403Z         else:
2025-05-07T20:31:43.5670618Z             scale_ub_tensor = None
2025-05-07T20:31:43.5670878Z     
2025-05-07T20:31:43.5671120Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.5671447Z             op = silu_mul_quant
2025-05-07T20:31:43.5671707Z             if compiled:
2025-05-07T20:31:43.5671974Z                 op = torch.compile(op)
2025-05-07T20:31:43.5672434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5672717Z     
2025-05-07T20:31:43.5672923Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.5673096Z 
2025-05-07T20:31:43.5673212Z moe/activation_test.py:117: 
2025-05-07T20:31:43.5673511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5673852Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.5674153Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5674913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.5675623Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.5676177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.5676878Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.5677561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.5678109Z     kernel = self.compile(
2025-05-07T20:31:43.5678678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.5679359Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.5679761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5680000Z 
2025-05-07T20:31:43.5680213Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f5c28f0>
2025-05-07T20:31:43.5681317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.5682739Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc5144cc3a0>}
2025-05-07T20:31:43.5684107Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.5685297Z context = <triton._C.libtriton.ir.context object at 0x7fc50f5d20f0>
2025-05-07T20:31:43.5685601Z 
2025-05-07T20:31:43.5685773Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.5686312Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.5686789Z                            module_map=module_map)
2025-05-07T20:31:43.5687164Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.5687530Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.5687792Z E       ^
2025-05-07T20:31:43.5688275Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.5688741Z 
2025-05-07T20:31:43.5689167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.5689693Z 
2025-05-07T20:31:43.5689810Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.5690236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.5690647Z     T=1,
2025-05-07T20:31:43.5690841Z     D=7168,
2025-05-07T20:31:43.5691038Z     scale_ub=1200.0,
2025-05-07T20:31:43.5691277Z     contiguous=True,
2025-05-07T20:31:43.5691509Z     compiled=True,
2025-05-07T20:31:43.5691720Z )
2025-05-07T20:31:43.5692050Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.5692547Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:43.5692812Z 
2025-05-07T20:31:43.5692900Z     @given(
2025-05-07T20:31:43.5693220Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.5693545Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.5693866Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.5694207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.5694550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.5694843Z     )
2025-05-07T20:31:43.5695201Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.5695659Z     def test_silu_mul_quant(
2025-05-07T20:31:43.5695914Z         self,
2025-05-07T20:31:43.5696122Z         T: int,
2025-05-07T20:31:43.5696331Z         D: int,
2025-05-07T20:31:43.5696565Z         scale_ub: Optional[float],
2025-05-07T20:31:43.5696853Z         contiguous: bool,
2025-05-07T20:31:43.5697102Z         compiled: bool,
2025-05-07T20:31:43.5697343Z     ) -> None:
2025-05-07T20:31:43.5697573Z         torch.manual_seed(2025)
2025-05-07T20:31:43.5697820Z     
2025-05-07T20:31:43.5698252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.5698610Z     
2025-05-07T20:31:43.5698813Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.5699127Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.5699451Z         x = x_sign * x_clamp
2025-05-07T20:31:43.5699703Z         x0 = x[:, :D]
2025-05-07T20:31:43.5699937Z         x1 = x[:, D:]
2025-05-07T20:31:43.5700158Z     
2025-05-07T20:31:43.5700352Z         if contiguous:
2025-05-07T20:31:43.5700598Z             x0 = x0.contiguous()
2025-05-07T20:31:43.5700868Z             x1 = x1.contiguous()
2025-05-07T20:31:43.5701121Z     
2025-05-07T20:31:43.5701321Z         if scale_ub is not None:
2025-05-07T20:31:43.5701612Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.5701959Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.5702272Z             )
2025-05-07T20:31:43.5702477Z         else:
2025-05-07T20:31:43.5702700Z             scale_ub_tensor = None
2025-05-07T20:31:43.5702968Z     
2025-05-07T20:31:43.5703213Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.5703541Z             op = silu_mul_quant
2025-05-07T20:31:43.5703933Z             if compiled:
2025-05-07T20:31:43.5704195Z                 op = torch.compile(op)
2025-05-07T20:31:43.5704507Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5704827Z     
2025-05-07T20:31:43.5705046Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.5705216Z 
2025-05-07T20:31:43.5705326Z moe/activation_test.py:117: 
2025-05-07T20:31:43.5705632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5705964Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.5706262Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.5706837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.5707407Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.5708087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.5708800Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.5709356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.5710045Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.5710736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.5711292Z     kernel = self.compile(
2025-05-07T20:31:43.5711847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.5712524Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.5712931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.5713267Z 
2025-05-07T20:31:43.5713492Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f5c23e0>
2025-05-07T20:31:43.5714592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.5716042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc5144cedd0>}
2025-05-07T20:31:43.5728081Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.5729139Z context = <triton._C.libtriton.ir.context object at 0x7fc50f5ca030>
2025-05-07T20:31:43.5729436Z 
2025-05-07T20:31:43.5729626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.5730157Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.5730644Z                            module_map=module_map)
2025-05-07T20:31:43.5731023Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.5731391Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.5731653Z E       ^
2025-05-07T20:31:43.5732135Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.5732597Z 
2025-05-07T20:31:43.5733029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.5733550Z 
2025-05-07T20:31:43.5733658Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.5734082Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.5734501Z     T=1,
2025-05-07T20:31:43.5734727Z     D=7168,
2025-05-07T20:31:43.5734949Z     scale_ub=1200.0,
2025-05-07T20:31:43.5735184Z     contiguous=False,
2025-05-07T20:31:43.5735420Z     compiled=True,
2025-05-07T20:31:43.5735746Z )
2025-05-07T20:31:43.9106201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:43.9106780Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:43.9107054Z 
2025-05-07T20:31:43.9107145Z     @given(
2025-05-07T20:31:43.9107383Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:43.9107707Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:43.9108032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:43.9108371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:43.9108714Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:43.9109012Z     )
2025-05-07T20:31:43.9109370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:43.9109848Z     def test_silu_mul_quant(
2025-05-07T20:31:43.9110103Z         self,
2025-05-07T20:31:43.9110310Z         T: int,
2025-05-07T20:31:43.9110519Z         D: int,
2025-05-07T20:31:43.9110749Z         scale_ub: Optional[float],
2025-05-07T20:31:43.9111036Z         contiguous: bool,
2025-05-07T20:31:43.9111280Z         compiled: bool,
2025-05-07T20:31:43.9111522Z     ) -> None:
2025-05-07T20:31:43.9111752Z         torch.manual_seed(2025)
2025-05-07T20:31:43.9111998Z     
2025-05-07T20:31:43.9112286Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:43.9112643Z     
2025-05-07T20:31:43.9112849Z         x_sign = torch.sign(x)
2025-05-07T20:31:43.9113146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:43.9113465Z         x = x_sign * x_clamp
2025-05-07T20:31:43.9113713Z         x0 = x[:, :D]
2025-05-07T20:31:43.9113931Z         x1 = x[:, D:]
2025-05-07T20:31:43.9114145Z     
2025-05-07T20:31:43.9114844Z         if contiguous:
2025-05-07T20:31:43.9115093Z             x0 = x0.contiguous()
2025-05-07T20:31:43.9115361Z             x1 = x1.contiguous()
2025-05-07T20:31:43.9115611Z     
2025-05-07T20:31:43.9115812Z         if scale_ub is not None:
2025-05-07T20:31:43.9116100Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:43.9116451Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:43.9116761Z             )
2025-05-07T20:31:43.9116961Z         else:
2025-05-07T20:31:43.9117180Z             scale_ub_tensor = None
2025-05-07T20:31:43.9117443Z     
2025-05-07T20:31:43.9117681Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:43.9118003Z             op = silu_mul_quant
2025-05-07T20:31:43.9118265Z             if compiled:
2025-05-07T20:31:43.9118523Z                 op = torch.compile(op)
2025-05-07T20:31:43.9118840Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.9119131Z     
2025-05-07T20:31:43.9119328Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:43.9119511Z 
2025-05-07T20:31:43.9119616Z moe/activation_test.py:117: 
2025-05-07T20:31:43.9119927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.9120262Z moe/activation_test.py:115: in fn
2025-05-07T20:31:43.9120558Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:43.9121136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:43.9121716Z     return fn(*args, **kwargs)
2025-05-07T20:31:43.9122384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:43.9123089Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:43.9123644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:43.9124335Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:43.9125071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:43.9125613Z     kernel = self.compile(
2025-05-07T20:31:43.9126327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:43.9126994Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:43.9127399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:43.9127627Z 
2025-05-07T20:31:43.9127845Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f5c0b20>
2025-05-07T20:31:43.9128946Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:43.9130355Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc5144cc0d0>}
2025-05-07T20:31:43.9131724Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:43.9132775Z context = <triton._C.libtriton.ir.context object at 0x7fc50f7dfdf0>
2025-05-07T20:31:43.9133068Z 
2025-05-07T20:31:43.9133245Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:43.9133775Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:43.9134258Z                            module_map=module_map)
2025-05-07T20:31:43.9134645Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:43.9135048Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:43.9135326Z E       ^
2025-05-07T20:31:43.9135915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:43.9136375Z 
2025-05-07T20:31:43.9136813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:43.9137332Z 
2025-05-07T20:31:43.9137447Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:43.9137867Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:43.9138360Z     T=1,
2025-05-07T20:31:43.9138552Z     D=7168,
2025-05-07T20:31:43.9138745Z     scale_ub=None,
2025-05-07T20:31:43.9138973Z     contiguous=False,
2025-05-07T20:31:43.9139211Z     compiled=True,
2025-05-07T20:31:43.9139419Z )
2025-05-07T20:31:44.0098517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.0099057Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:44.0099325Z 
2025-05-07T20:31:44.0099412Z     @given(
2025-05-07T20:31:44.0099667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.0099991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.0100319Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.0100654Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.0101000Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.0101294Z     )
2025-05-07T20:31:44.0101648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.0102103Z     def test_silu_mul_quant(
2025-05-07T20:31:44.0102355Z         self,
2025-05-07T20:31:44.0102551Z         T: int,
2025-05-07T20:31:44.0102756Z         D: int,
2025-05-07T20:31:44.0102988Z         scale_ub: Optional[float],
2025-05-07T20:31:44.0103291Z         contiguous: bool,
2025-05-07T20:31:44.0103536Z         compiled: bool,
2025-05-07T20:31:44.0103781Z     ) -> None:
2025-05-07T20:31:44.0104007Z         torch.manual_seed(2025)
2025-05-07T20:31:44.0104257Z     
2025-05-07T20:31:44.0104541Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.0104926Z     
2025-05-07T20:31:44.0105485Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.0105778Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.0106094Z         x = x_sign * x_clamp
2025-05-07T20:31:44.0106339Z         x0 = x[:, :D]
2025-05-07T20:31:44.0106555Z         x1 = x[:, D:]
2025-05-07T20:31:44.0106769Z     
2025-05-07T20:31:44.0106962Z         if contiguous:
2025-05-07T20:31:44.0107196Z             x0 = x0.contiguous()
2025-05-07T20:31:44.0107460Z             x1 = x1.contiguous()
2025-05-07T20:31:44.0107705Z     
2025-05-07T20:31:44.0107900Z         if scale_ub is not None:
2025-05-07T20:31:44.0108185Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.0108532Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.0108842Z             )
2025-05-07T20:31:44.0109042Z         else:
2025-05-07T20:31:44.0109265Z             scale_ub_tensor = None
2025-05-07T20:31:44.0109520Z     
2025-05-07T20:31:44.0109758Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.0110084Z             op = silu_mul_quant
2025-05-07T20:31:44.0110339Z             if compiled:
2025-05-07T20:31:44.0110589Z                 op = torch.compile(op)
2025-05-07T20:31:44.0110898Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.0111178Z     
2025-05-07T20:31:44.0111376Z         y_fp8, y_scale = fn()
2025-05-07T20:31:44.0111670Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:44.0111966Z     
2025-05-07T20:31:44.0112208Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.0112551Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:44.0112852Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:44.0113170Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:44.0113683Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:44.0114008Z     
2025-05-07T20:31:44.0114224Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:44.0114428Z 
2025-05-07T20:31:44.0114536Z moe/activation_test.py:126: 
2025-05-07T20:31:44.0114877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.0115232Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:44.0115566Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:44.0116371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:44.0117138Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:44.0117697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.0118388Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.0119095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:44.0119837Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:44.0120604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:44.0121365Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:44.0122113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:44.0122765Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:44.0123372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:44.0123916Z     fn()
2025-05-07T20:31:44.0124446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:44.0125077Z     self.fn.run(
2025-05-07T20:31:44.0125574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.0126242Z     kernel = self.compile(
2025-05-07T20:31:44.0126801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.0127474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.0127872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.0128109Z 
2025-05-07T20:31:44.0128321Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f751450>
2025-05-07T20:31:44.0129433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.0130851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc5144ce440>}
2025-05-07T20:31:44.0132219Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.0133274Z context = <triton._C.libtriton.ir.context object at 0x7fc50f7aaa70>
2025-05-07T20:31:44.0133576Z 
2025-05-07T20:31:44.0133748Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.0134287Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.0134787Z                            module_map=module_map)
2025-05-07T20:31:44.0135192Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.0135643Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:44.0135925Z E       ^
2025-05-07T20:31:44.0136400Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.0136868Z 
2025-05-07T20:31:44.0137293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.0137811Z 
2025-05-07T20:31:44.0137926Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.0138450Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.0138855Z     T=1,
2025-05-07T20:31:44.0139048Z     D=5120,
2025-05-07T20:31:44.0139251Z     scale_ub=1200.0,
2025-05-07T20:31:44.0139481Z     contiguous=False,
2025-05-07T20:31:44.0139720Z     compiled=True,
2025-05-07T20:31:44.0139939Z )
2025-05-07T20:31:44.1823639Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.1824219Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.1824498Z 
2025-05-07T20:31:44.1824579Z     @given(
2025-05-07T20:31:44.1824824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.1825151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.1825470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.1825817Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.1826157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.1826443Z     )
2025-05-07T20:31:44.1826802Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.1827254Z     def test_silu_mul_quant(
2025-05-07T20:31:44.1827497Z         self,
2025-05-07T20:31:44.1827700Z         T: int,
2025-05-07T20:31:44.1827904Z         D: int,
2025-05-07T20:31:44.1828130Z         scale_ub: Optional[float],
2025-05-07T20:31:44.1828417Z         contiguous: bool,
2025-05-07T20:31:44.1828666Z         compiled: bool,
2025-05-07T20:31:44.1828901Z     ) -> None:
2025-05-07T20:31:44.1829130Z         torch.manual_seed(2025)
2025-05-07T20:31:44.1829384Z     
2025-05-07T20:31:44.1829992Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.1830354Z     
2025-05-07T20:31:44.1830563Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.1830871Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.1831184Z         x = x_sign * x_clamp
2025-05-07T20:31:44.1831434Z         x0 = x[:, :D]
2025-05-07T20:31:44.1831658Z         x1 = x[:, D:]
2025-05-07T20:31:44.1831873Z     
2025-05-07T20:31:44.1832070Z         if contiguous:
2025-05-07T20:31:44.1832310Z             x0 = x0.contiguous()
2025-05-07T20:31:44.1832572Z             x1 = x1.contiguous()
2025-05-07T20:31:44.1832827Z     
2025-05-07T20:31:44.1833031Z         if scale_ub is not None:
2025-05-07T20:31:44.1833310Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.1833661Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.1833984Z             )
2025-05-07T20:31:44.1834185Z         else:
2025-05-07T20:31:44.1834405Z             scale_ub_tensor = None
2025-05-07T20:31:44.1834685Z     
2025-05-07T20:31:44.1834961Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.1835283Z             op = silu_mul_quant
2025-05-07T20:31:44.1835539Z             if compiled:
2025-05-07T20:31:44.1835788Z                 op = torch.compile(op)
2025-05-07T20:31:44.1836099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.1836378Z     
2025-05-07T20:31:44.1836579Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.1836751Z 
2025-05-07T20:31:44.1836857Z moe/activation_test.py:117: 
2025-05-07T20:31:44.1837158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.1837492Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.1837779Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.1838506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.1839084Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.1839764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.1840461Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.1841010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.1841709Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.1842379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.1842921Z     kernel = self.compile(
2025-05-07T20:31:44.1843477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.1844155Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.1844552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.1844792Z 
2025-05-07T20:31:44.1845031Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f676c50>
2025-05-07T20:31:44.1846153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.1847566Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50feca5f0>}
2025-05-07T20:31:44.1848933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.1849983Z context = <triton._C.libtriton.ir.context object at 0x7fc50f35f170>
2025-05-07T20:31:44.1850286Z 
2025-05-07T20:31:44.1850457Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.1851078Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.1851554Z                            module_map=module_map)
2025-05-07T20:31:44.1851930Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.1852294Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.1852567Z E       ^
2025-05-07T20:31:44.1853038Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.1853506Z 
2025-05-07T20:31:44.1853931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.1854452Z 
2025-05-07T20:31:44.1854574Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.1854997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.1855411Z     T=1,
2025-05-07T20:31:44.1856082Z     D=5120,
2025-05-07T20:31:44.1856290Z     scale_ub=1200.0,
2025-05-07T20:31:44.1856517Z     contiguous=False,
2025-05-07T20:31:44.1856753Z     compiled=False,
2025-05-07T20:31:44.1856973Z )
2025-05-07T20:31:44.1857298Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.1857805Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:44.1858158Z 
2025-05-07T20:31:44.1858242Z     @given(
2025-05-07T20:31:44.1858476Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.1858797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.1859113Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.1859447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.1859917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.1860215Z     )
2025-05-07T20:31:44.1860582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.1861037Z     def test_silu_mul_quant(
2025-05-07T20:31:44.1861284Z         self,
2025-05-07T20:31:44.1861491Z         T: int,
2025-05-07T20:31:44.1861693Z         D: int,
2025-05-07T20:31:44.1861922Z         scale_ub: Optional[float],
2025-05-07T20:31:44.1862204Z         contiguous: bool,
2025-05-07T20:31:44.1862444Z         compiled: bool,
2025-05-07T20:31:44.1862676Z     ) -> None:
2025-05-07T20:31:44.1862899Z         torch.manual_seed(2025)
2025-05-07T20:31:44.1863142Z     
2025-05-07T20:31:44.1863427Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.1863775Z     
2025-05-07T20:31:44.1863970Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.1864277Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.1864600Z         x = x_sign * x_clamp
2025-05-07T20:31:44.1864851Z         x0 = x[:, :D]
2025-05-07T20:31:44.1865076Z         x1 = x[:, D:]
2025-05-07T20:31:44.1865295Z     
2025-05-07T20:31:44.1865489Z         if contiguous:
2025-05-07T20:31:44.1865728Z             x0 = x0.contiguous()
2025-05-07T20:31:44.1865996Z             x1 = x1.contiguous()
2025-05-07T20:31:44.1866244Z     
2025-05-07T20:31:44.1866443Z         if scale_ub is not None:
2025-05-07T20:31:44.1866751Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.1867099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.1867408Z             )
2025-05-07T20:31:44.1867610Z         else:
2025-05-07T20:31:44.1867832Z             scale_ub_tensor = None
2025-05-07T20:31:44.1868094Z     
2025-05-07T20:31:44.1868332Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.1868656Z             op = silu_mul_quant
2025-05-07T20:31:44.1868917Z             if compiled:
2025-05-07T20:31:44.1869171Z                 op = torch.compile(op)
2025-05-07T20:31:44.1869485Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.1869772Z     
2025-05-07T20:31:44.1869969Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.1870279Z 
2025-05-07T20:31:44.1870384Z moe/activation_test.py:117: 
2025-05-07T20:31:44.1870687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.1871020Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.1871320Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.1872028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.1872734Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.1873283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.1873979Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.1874665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.1875210Z     kernel = self.compile(
2025-05-07T20:31:44.1875772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.1876442Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.1876846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.1877073Z 
2025-05-07T20:31:44.1877285Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f674790>
2025-05-07T20:31:44.1878383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.1879861Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50feca050>}
2025-05-07T20:31:44.1881234Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.1882283Z context = <triton._C.libtriton.ir.context object at 0x7fc50f337b30>
2025-05-07T20:31:44.1882576Z 
2025-05-07T20:31:44.1882746Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.1883280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.1883765Z                            module_map=module_map)
2025-05-07T20:31:44.1884133Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.1884498Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.1884765Z E       ^
2025-05-07T20:31:44.1885246Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.1885704Z 
2025-05-07T20:31:44.1886130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.1886665Z 
2025-05-07T20:31:44.1886772Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.1887201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.1887616Z     T=16384,
2025-05-07T20:31:44.1887809Z     D=5120,
2025-05-07T20:31:44.1888022Z     scale_ub=1200.0,
2025-05-07T20:31:44.1888258Z     contiguous=False,
2025-05-07T20:31:44.1888484Z     compiled=True,
2025-05-07T20:31:44.1888695Z )
2025-05-07T20:31:44.2900838Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.2901651Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.2902058Z 
2025-05-07T20:31:44.2902172Z     @given(
2025-05-07T20:31:44.2902503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.2902837Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.2903365Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.2903716Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.2904065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.2904359Z     )
2025-05-07T20:31:44.2904734Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.2905197Z     def test_silu_mul_quant(
2025-05-07T20:31:44.2905448Z         self,
2025-05-07T20:31:44.2905659Z         T: int,
2025-05-07T20:31:44.2905867Z         D: int,
2025-05-07T20:31:44.2906098Z         scale_ub: Optional[float],
2025-05-07T20:31:44.2906386Z         contiguous: bool,
2025-05-07T20:31:44.2906644Z         compiled: bool,
2025-05-07T20:31:44.2906876Z     ) -> None:
2025-05-07T20:31:44.2907106Z         torch.manual_seed(2025)
2025-05-07T20:31:44.2907370Z     
2025-05-07T20:31:44.2907652Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.2908013Z     
2025-05-07T20:31:44.2908232Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.2908543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.2908860Z         x = x_sign * x_clamp
2025-05-07T20:31:44.2909113Z         x0 = x[:, :D]
2025-05-07T20:31:44.2909344Z         x1 = x[:, D:]
2025-05-07T20:31:44.2909557Z     
2025-05-07T20:31:44.2909768Z         if contiguous:
2025-05-07T20:31:44.2910020Z             x0 = x0.contiguous()
2025-05-07T20:31:44.2910287Z             x1 = x1.contiguous()
2025-05-07T20:31:44.2910542Z     
2025-05-07T20:31:44.2910752Z         if scale_ub is not None:
2025-05-07T20:31:44.2911035Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.2911389Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.2911713Z             )
2025-05-07T20:31:44.2911916Z         else:
2025-05-07T20:31:44.2912285Z             scale_ub_tensor = None
2025-05-07T20:31:44.2912558Z     
2025-05-07T20:31:44.2912801Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.2913140Z             op = silu_mul_quant
2025-05-07T20:31:44.2913405Z             if compiled:
2025-05-07T20:31:44.2913669Z                 op = torch.compile(op)
2025-05-07T20:31:44.2913976Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2914265Z     
2025-05-07T20:31:44.2914473Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.2914644Z 
2025-05-07T20:31:44.2914752Z moe/activation_test.py:117: 
2025-05-07T20:31:44.2915065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2915433Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.2915751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2916336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.2916919Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.2917603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.2918317Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.2918864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.2919562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.2920244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.2920793Z     kernel = self.compile(
2025-05-07T20:31:44.2921347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.2922024Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.2922437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2922668Z 
2025-05-07T20:31:44.2922890Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f7176d0>
2025-05-07T20:31:44.2924081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.2925556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50fec83a0>}
2025-05-07T20:31:44.2926961Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.2928015Z context = <triton._C.libtriton.ir.context object at 0x7fc50f3c32b0>
2025-05-07T20:31:44.2928313Z 
2025-05-07T20:31:44.2928495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.2929030Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.2929521Z                            module_map=module_map)
2025-05-07T20:31:44.2929903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.2930279Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.2930548Z E       ^
2025-05-07T20:31:44.2931032Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.2931493Z 
2025-05-07T20:31:44.2931928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.2932449Z 
2025-05-07T20:31:44.2932558Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.2932988Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.2933488Z     T=2048,
2025-05-07T20:31:44.2933691Z     D=7168,
2025-05-07T20:31:44.2933891Z     scale_ub=1200.0,
2025-05-07T20:31:44.2934129Z     contiguous=False,
2025-05-07T20:31:44.2934373Z     compiled=True,
2025-05-07T20:31:44.2934583Z )
2025-05-07T20:31:44.2934920Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.2935435Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:44.2935714Z 
2025-05-07T20:31:44.2935795Z     @given(
2025-05-07T20:31:44.2936036Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.2936362Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.2936672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.2937013Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.2937354Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.2937651Z     )
2025-05-07T20:31:44.2938103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.2938561Z     def test_silu_mul_quant(
2025-05-07T20:31:44.2938817Z         self,
2025-05-07T20:31:44.2939021Z         T: int,
2025-05-07T20:31:44.2939227Z         D: int,
2025-05-07T20:31:44.2939455Z         scale_ub: Optional[float],
2025-05-07T20:31:44.2939736Z         contiguous: bool,
2025-05-07T20:31:44.2939987Z         compiled: bool,
2025-05-07T20:31:44.2940220Z     ) -> None:
2025-05-07T20:31:44.2940441Z         torch.manual_seed(2025)
2025-05-07T20:31:44.2940692Z     
2025-05-07T20:31:44.2940978Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.2941324Z     
2025-05-07T20:31:44.2941528Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.2941831Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.2942152Z         x = x_sign * x_clamp
2025-05-07T20:31:44.2942397Z         x0 = x[:, :D]
2025-05-07T20:31:44.2942624Z         x1 = x[:, D:]
2025-05-07T20:31:44.2942842Z     
2025-05-07T20:31:44.2943035Z         if contiguous:
2025-05-07T20:31:44.2943276Z             x0 = x0.contiguous()
2025-05-07T20:31:44.2943546Z             x1 = x1.contiguous()
2025-05-07T20:31:44.2943887Z     
2025-05-07T20:31:44.2944091Z         if scale_ub is not None:
2025-05-07T20:31:44.2944380Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.2944725Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.2945044Z             )
2025-05-07T20:31:44.2945251Z         else:
2025-05-07T20:31:44.2945501Z             scale_ub_tensor = None
2025-05-07T20:31:44.2945789Z     
2025-05-07T20:31:44.2946035Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.2946354Z             op = silu_mul_quant
2025-05-07T20:31:44.2946616Z             if compiled:
2025-05-07T20:31:44.2946877Z                 op = torch.compile(op)
2025-05-07T20:31:44.2947182Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2947469Z     
2025-05-07T20:31:44.2947680Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.2947850Z 
2025-05-07T20:31:44.2947961Z moe/activation_test.py:117: 
2025-05-07T20:31:44.2948262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2948605Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.2948900Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.2949472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.2950049Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.2950729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.2951432Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.2951979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.2952758Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.2953442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.2953988Z     kernel = self.compile(
2025-05-07T20:31:44.2954553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.2955228Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.2955928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.2956161Z 
2025-05-07T20:31:44.2956377Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f71ae90>
2025-05-07T20:31:44.2957481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.2958892Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50feca200>}
2025-05-07T20:31:44.2960272Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.2961321Z context = <triton._C.libtriton.ir.context object at 0x7fc50f0a4a30>
2025-05-07T20:31:44.2961615Z 
2025-05-07T20:31:44.2961786Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.2962324Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.2962808Z                            module_map=module_map)
2025-05-07T20:31:44.2963179Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.2963549Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.2963821Z E       ^
2025-05-07T20:31:44.2964304Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.2964902Z 
2025-05-07T20:31:44.2965339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.2965872Z 
2025-05-07T20:31:44.4250611Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.4251936Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.4253088Z     T=1,
2025-05-07T20:31:44.4253601Z     D=5120,
2025-05-07T20:31:44.4254150Z     scale_ub=None,
2025-05-07T20:31:44.4254668Z     contiguous=False,
2025-05-07T20:31:44.4254899Z     compiled=False,
2025-05-07T20:31:44.4255113Z )
2025-05-07T20:31:44.4255445Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.4256188Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:44.4256478Z 
2025-05-07T20:31:44.4256562Z     @given(
2025-05-07T20:31:44.4256801Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.4257133Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.4257441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.4257778Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.4258164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.4258448Z     )
2025-05-07T20:31:44.4258810Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.4259262Z     def test_silu_mul_quant(
2025-05-07T20:31:44.4259502Z         self,
2025-05-07T20:31:44.4259700Z         T: int,
2025-05-07T20:31:44.4259901Z         D: int,
2025-05-07T20:31:44.4260119Z         scale_ub: Optional[float],
2025-05-07T20:31:44.4260398Z         contiguous: bool,
2025-05-07T20:31:44.4260655Z         compiled: bool,
2025-05-07T20:31:44.4260897Z     ) -> None:
2025-05-07T20:31:44.4261317Z         torch.manual_seed(2025)
2025-05-07T20:31:44.4261573Z     
2025-05-07T20:31:44.4261856Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.4262205Z     
2025-05-07T20:31:44.4262408Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.4262714Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.4263058Z         x = x_sign * x_clamp
2025-05-07T20:31:44.4263303Z         x0 = x[:, :D]
2025-05-07T20:31:44.4263520Z         x1 = x[:, D:]
2025-05-07T20:31:44.4263738Z     
2025-05-07T20:31:44.4263940Z         if contiguous:
2025-05-07T20:31:44.4264175Z             x0 = x0.contiguous()
2025-05-07T20:31:44.4264445Z             x1 = x1.contiguous()
2025-05-07T20:31:44.4264694Z     
2025-05-07T20:31:44.4264888Z         if scale_ub is not None:
2025-05-07T20:31:44.4265169Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.4265552Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.4265877Z             )
2025-05-07T20:31:44.4266074Z         else:
2025-05-07T20:31:44.4266289Z             scale_ub_tensor = None
2025-05-07T20:31:44.4266547Z     
2025-05-07T20:31:44.4266784Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.4267104Z             op = silu_mul_quant
2025-05-07T20:31:44.4267358Z             if compiled:
2025-05-07T20:31:44.4267604Z                 op = torch.compile(op)
2025-05-07T20:31:44.4267909Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.4268192Z     
2025-05-07T20:31:44.4268387Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.4268563Z 
2025-05-07T20:31:44.4268667Z moe/activation_test.py:117: 
2025-05-07T20:31:44.4268973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.4269313Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.4269595Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.4270307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.4271014Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.4271557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.4272379Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.4273057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.4273602Z     kernel = self.compile(
2025-05-07T20:31:44.4274150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.4274822Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.4275224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.4275450Z 
2025-05-07T20:31:44.4275696Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f0cd840>
2025-05-07T20:31:44.4276823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.4278229Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50fecb490>}
2025-05-07T20:31:44.4279599Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.4280645Z context = <triton._C.libtriton.ir.context object at 0x7fc50f01d330>
2025-05-07T20:31:44.4280938Z 
2025-05-07T20:31:44.4281110Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.4281727Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.4282211Z                            module_map=module_map)
2025-05-07T20:31:44.4282592Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.4282951Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.4283218Z E       ^
2025-05-07T20:31:44.4283697Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.4284157Z 
2025-05-07T20:31:44.4284581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.4285114Z 
2025-05-07T20:31:44.4285223Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.4285649Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.4286059Z     T=4096,
2025-05-07T20:31:44.4286253Z     D=7168,
2025-05-07T20:31:44.4286455Z     scale_ub=1200.0,
2025-05-07T20:31:44.4286698Z     contiguous=False,
2025-05-07T20:31:44.4286930Z     compiled=False,
2025-05-07T20:31:44.4287146Z )
2025-05-07T20:31:44.4287480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.4287987Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:44.4288271Z 
2025-05-07T20:31:44.4288349Z     @given(
2025-05-07T20:31:44.4288586Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.4288902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.4289208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.4289545Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.4289882Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.4290173Z     )
2025-05-07T20:31:44.4290535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.4290990Z     def test_silu_mul_quant(
2025-05-07T20:31:44.4291239Z         self,
2025-05-07T20:31:44.4291440Z         T: int,
2025-05-07T20:31:44.4291640Z         D: int,
2025-05-07T20:31:44.4291859Z         scale_ub: Optional[float],
2025-05-07T20:31:44.4292223Z         contiguous: bool,
2025-05-07T20:31:44.4292470Z         compiled: bool,
2025-05-07T20:31:44.4292694Z     ) -> None:
2025-05-07T20:31:44.4292917Z         torch.manual_seed(2025)
2025-05-07T20:31:44.4293166Z     
2025-05-07T20:31:44.4293440Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.4293787Z     
2025-05-07T20:31:44.4293987Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.4294287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.4294596Z         x = x_sign * x_clamp
2025-05-07T20:31:44.4294841Z         x0 = x[:, :D]
2025-05-07T20:31:44.4295066Z         x1 = x[:, D:]
2025-05-07T20:31:44.4295273Z     
2025-05-07T20:31:44.4295488Z         if contiguous:
2025-05-07T20:31:44.4295751Z             x0 = x0.contiguous()
2025-05-07T20:31:44.4296017Z             x1 = x1.contiguous()
2025-05-07T20:31:44.4296264Z     
2025-05-07T20:31:44.4296465Z         if scale_ub is not None:
2025-05-07T20:31:44.4296747Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.4297091Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.4297406Z             )
2025-05-07T20:31:44.4297600Z         else:
2025-05-07T20:31:44.4297821Z             scale_ub_tensor = None
2025-05-07T20:31:44.4298170Z     
2025-05-07T20:31:44.4298406Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.4298728Z             op = silu_mul_quant
2025-05-07T20:31:44.4298986Z             if compiled:
2025-05-07T20:31:44.4299241Z                 op = torch.compile(op)
2025-05-07T20:31:44.4299542Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.4299827Z     
2025-05-07T20:31:44.4300032Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.4300200Z 
2025-05-07T20:31:44.4300301Z moe/activation_test.py:117: 
2025-05-07T20:31:44.4300686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.4301019Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.4301309Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.4302012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.4302717Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.4303268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.4303958Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.4304641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.4305185Z     kernel = self.compile(
2025-05-07T20:31:44.4305775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.4306459Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.4306859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.4307091Z 
2025-05-07T20:31:44.4307309Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f756ad0>
2025-05-07T20:31:44.4308402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.4309799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f61c550>}
2025-05-07T20:31:44.4311172Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.4312220Z context = <triton._C.libtriton.ir.context object at 0x7fc50f6e6630>
2025-05-07T20:31:44.4312601Z 
2025-05-07T20:31:44.4312779Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.4313308Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.4313791Z                            module_map=module_map)
2025-05-07T20:31:44.4314169Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.4314525Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.4314792Z E       ^
2025-05-07T20:31:44.4315269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.4315727Z 
2025-05-07T20:31:44.4316163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.4316688Z 
2025-05-07T20:31:44.4316798Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.4317225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.4317637Z     T=16384,
2025-05-07T20:31:44.4317832Z     D=7168,
2025-05-07T20:31:44.4318031Z     scale_ub=None,
2025-05-07T20:31:44.4318253Z     contiguous=True,
2025-05-07T20:31:44.4318482Z     compiled=True,
2025-05-07T20:31:44.4318692Z )
2025-05-07T20:31:44.6271349Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6272151Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:44.6272548Z 
2025-05-07T20:31:44.6272663Z     @given(
2025-05-07T20:31:44.6272992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6273349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6273667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6274205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6274545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6274838Z     )
2025-05-07T20:31:44.6275212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6275709Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6275966Z         self,
2025-05-07T20:31:44.6276165Z         T: int,
2025-05-07T20:31:44.6276366Z         D: int,
2025-05-07T20:31:44.6276592Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6276872Z         contiguous: bool,
2025-05-07T20:31:44.6277118Z         compiled: bool,
2025-05-07T20:31:44.6277342Z     ) -> None:
2025-05-07T20:31:44.6277562Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6277808Z     
2025-05-07T20:31:44.6278079Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6278427Z     
2025-05-07T20:31:44.6278630Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.6278928Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.6279246Z         x = x_sign * x_clamp
2025-05-07T20:31:44.6279494Z         x0 = x[:, :D]
2025-05-07T20:31:44.6279710Z         x1 = x[:, D:]
2025-05-07T20:31:44.6279930Z     
2025-05-07T20:31:44.6280122Z         if contiguous:
2025-05-07T20:31:44.6280357Z             x0 = x0.contiguous()
2025-05-07T20:31:44.6280623Z             x1 = x1.contiguous()
2025-05-07T20:31:44.6280874Z     
2025-05-07T20:31:44.6281065Z         if scale_ub is not None:
2025-05-07T20:31:44.6281342Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.6281688Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.6282000Z             )
2025-05-07T20:31:44.6282195Z         else:
2025-05-07T20:31:44.6282411Z             scale_ub_tensor = None
2025-05-07T20:31:44.6282669Z     
2025-05-07T20:31:44.6282906Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.6283229Z             op = silu_mul_quant
2025-05-07T20:31:44.6283490Z             if compiled:
2025-05-07T20:31:44.6283739Z                 op = torch.compile(op)
2025-05-07T20:31:44.6284048Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6292174Z     
2025-05-07T20:31:44.6292393Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.6292573Z 
2025-05-07T20:31:44.6292680Z moe/activation_test.py:117: 
2025-05-07T20:31:44.6292988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6293332Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.6293622Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6294204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.6294781Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.6295458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.6296151Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.6296706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.6297401Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.6298142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.6298698Z     kernel = self.compile(
2025-05-07T20:31:44.6299259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.6299930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.6300335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6300571Z 
2025-05-07T20:31:44.6300785Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f7568c0>
2025-05-07T20:31:44.6302006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.6303419Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f61d360>}
2025-05-07T20:31:44.6304788Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.6305880Z context = <triton._C.libtriton.ir.context object at 0x7fc50f64e930>
2025-05-07T20:31:44.6306179Z 
2025-05-07T20:31:44.6306348Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.6306880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.6307357Z                            module_map=module_map)
2025-05-07T20:31:44.6307735Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.6308104Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.6308377Z E       ^
2025-05-07T20:31:44.6308849Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.6309319Z 
2025-05-07T20:31:44.6309743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.6310263Z 
2025-05-07T20:31:44.6310375Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.6310800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.6311203Z     T=4096,
2025-05-07T20:31:44.6311395Z     D=5120,
2025-05-07T20:31:44.6311595Z     scale_ub=None,
2025-05-07T20:31:44.6311813Z     contiguous=False,
2025-05-07T20:31:44.6312053Z     compiled=True,
2025-05-07T20:31:44.6312264Z )
2025-05-07T20:31:44.6312598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.6313101Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:44.6313456Z 
2025-05-07T20:31:44.6313540Z     @given(
2025-05-07T20:31:44.6313772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.6314094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.6314414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.6314750Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.6315079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.6315369Z     )
2025-05-07T20:31:44.6315764Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.6316227Z     def test_silu_mul_quant(
2025-05-07T20:31:44.6316476Z         self,
2025-05-07T20:31:44.6316675Z         T: int,
2025-05-07T20:31:44.6316944Z         D: int,
2025-05-07T20:31:44.6317281Z         scale_ub: Optional[float],
2025-05-07T20:31:44.6317599Z         contiguous: bool,
2025-05-07T20:31:44.6317844Z         compiled: bool,
2025-05-07T20:31:44.6318077Z     ) -> None:
2025-05-07T20:31:44.6318298Z         torch.manual_seed(2025)
2025-05-07T20:31:44.6318545Z     
2025-05-07T20:31:44.6318824Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.6319166Z     
2025-05-07T20:31:44.6319366Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.6319664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.6319975Z         x = x_sign * x_clamp
2025-05-07T20:31:44.6320213Z         x0 = x[:, :D]
2025-05-07T20:31:44.6320431Z         x1 = x[:, D:]
2025-05-07T20:31:44.6320640Z     
2025-05-07T20:31:44.6320824Z         if contiguous:
2025-05-07T20:31:44.6321059Z             x0 = x0.contiguous()
2025-05-07T20:31:44.6321320Z             x1 = x1.contiguous()
2025-05-07T20:31:44.6321559Z     
2025-05-07T20:31:44.6321850Z         if scale_ub is not None:
2025-05-07T20:31:44.6322133Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.6322470Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.6322784Z             )
2025-05-07T20:31:44.6322979Z         else:
2025-05-07T20:31:44.6323189Z             scale_ub_tensor = None
2025-05-07T20:31:44.6323454Z     
2025-05-07T20:31:44.6323697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.6324013Z             op = silu_mul_quant
2025-05-07T20:31:44.6324267Z             if compiled:
2025-05-07T20:31:44.6324520Z                 op = torch.compile(op)
2025-05-07T20:31:44.6324820Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6325098Z     
2025-05-07T20:31:44.6325295Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.6325463Z 
2025-05-07T20:31:44.6325570Z moe/activation_test.py:117: 
2025-05-07T20:31:44.6325882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6326222Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.6326514Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.6327079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:44.6327655Z     return fn(*args, **kwargs)
2025-05-07T20:31:44.6328325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:44.6329026Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:44.6329566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:44.6330254Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:44.6330926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:44.6331467Z     kernel = self.compile(
2025-05-07T20:31:44.6332019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:44.6332682Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:44.6333167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.6333391Z 
2025-05-07T20:31:44.6333602Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f08d270>
2025-05-07T20:31:44.6334700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:44.6336147Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f61dea0>}
2025-05-07T20:31:44.6337518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:44.6338642Z context = <triton._C.libtriton.ir.context object at 0x7fc50f150270>
2025-05-07T20:31:44.6338934Z 
2025-05-07T20:31:44.6339103Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:44.6339633Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:44.6340108Z                            module_map=module_map)
2025-05-07T20:31:44.6340478Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:44.6340831Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:44.6341280Z E       ^
2025-05-07T20:31:44.6341752Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:44.6342207Z 
2025-05-07T20:31:44.6342722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:44.6343249Z 
2025-05-07T20:31:44.9980593Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:44.9981856Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:44.9982819Z     T=4096,
2025-05-07T20:31:44.9983200Z     D=5120,
2025-05-07T20:31:44.9983593Z     scale_ub=1200.0,
2025-05-07T20:31:44.9984053Z     contiguous=False,
2025-05-07T20:31:44.9984506Z     compiled=False,
2025-05-07T20:31:44.9984895Z )
2025-05-07T20:31:44.9985229Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:44.9985741Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:44.9986034Z 
2025-05-07T20:31:44.9986114Z     @given(
2025-05-07T20:31:44.9986351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:44.9986675Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:44.9986994Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:44.9987335Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:44.9987677Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:44.9987975Z     )
2025-05-07T20:31:44.9988339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:44.9988793Z     def test_silu_mul_quant(
2025-05-07T20:31:44.9989037Z         self,
2025-05-07T20:31:44.9989242Z         T: int,
2025-05-07T20:31:44.9989448Z         D: int,
2025-05-07T20:31:44.9989670Z         scale_ub: Optional[float],
2025-05-07T20:31:44.9989954Z         contiguous: bool,
2025-05-07T20:31:44.9990203Z         compiled: bool,
2025-05-07T20:31:44.9990437Z     ) -> None:
2025-05-07T20:31:44.9990661Z         torch.manual_seed(2025)
2025-05-07T20:31:44.9990912Z     
2025-05-07T20:31:44.9991193Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:44.9991544Z     
2025-05-07T20:31:44.9991750Z         x_sign = torch.sign(x)
2025-05-07T20:31:44.9992051Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:44.9992366Z         x = x_sign * x_clamp
2025-05-07T20:31:44.9992796Z         x0 = x[:, :D]
2025-05-07T20:31:44.9993021Z         x1 = x[:, D:]
2025-05-07T20:31:44.9993233Z     
2025-05-07T20:31:44.9993428Z         if contiguous:
2025-05-07T20:31:44.9993675Z             x0 = x0.contiguous()
2025-05-07T20:31:44.9993942Z             x1 = x1.contiguous()
2025-05-07T20:31:44.9994194Z     
2025-05-07T20:31:44.9994404Z         if scale_ub is not None:
2025-05-07T20:31:44.9994687Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:44.9995041Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:44.9995361Z             )
2025-05-07T20:31:44.9995587Z         else:
2025-05-07T20:31:44.9995829Z             scale_ub_tensor = None
2025-05-07T20:31:44.9996098Z     
2025-05-07T20:31:44.9996336Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:44.9996669Z             op = silu_mul_quant
2025-05-07T20:31:44.9996929Z             if compiled:
2025-05-07T20:31:44.9997186Z                 op = torch.compile(op)
2025-05-07T20:31:44.9997499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9997781Z     
2025-05-07T20:31:44.9997980Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:44.9998148Z 
2025-05-07T20:31:44.9998250Z moe/activation_test.py:117: 
2025-05-07T20:31:44.9998555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:44.9998890Z moe/activation_test.py:115: in fn
2025-05-07T20:31:44.9999177Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:44.9999897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.0000610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.0001163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.0002002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.0002693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.0003254Z     kernel = self.compile(
2025-05-07T20:31:45.0003811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.0004490Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.0004902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.0005131Z 
2025-05-07T20:31:45.0005346Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f6d1780>
2025-05-07T20:31:45.0006466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.0007883Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f61e680>}
2025-05-07T20:31:45.0009271Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.0010329Z context = <triton._C.libtriton.ir.context object at 0x7fc50f1e7fb0>
2025-05-07T20:31:45.0010628Z 
2025-05-07T20:31:45.0010803Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.0011347Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.0011832Z                            module_map=module_map)
2025-05-07T20:31:45.0012210Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.0012582Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.0012856Z E       ^
2025-05-07T20:31:45.0013341Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.0013911Z 
2025-05-07T20:31:45.0014344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.0014876Z 
2025-05-07T20:31:45.0014983Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.0015441Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.0015852Z     T=4096,
2025-05-07T20:31:45.0016044Z     D=5120,
2025-05-07T20:31:45.0016236Z     scale_ub=1200.0,
2025-05-07T20:31:45.0016473Z     contiguous=False,
2025-05-07T20:31:45.0016705Z     compiled=True,
2025-05-07T20:31:45.0016908Z )
2025-05-07T20:31:45.0017239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.0017750Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.0018112Z 
2025-05-07T20:31:45.0018196Z     @given(
2025-05-07T20:31:45.0018443Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.0018767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.0019084Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.0019432Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.0019774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.0020065Z     )
2025-05-07T20:31:45.0020422Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.0020882Z     def test_silu_mul_quant(
2025-05-07T20:31:45.0021130Z         self,
2025-05-07T20:31:45.0021326Z         T: int,
2025-05-07T20:31:45.0021529Z         D: int,
2025-05-07T20:31:45.0021760Z         scale_ub: Optional[float],
2025-05-07T20:31:45.0022038Z         contiguous: bool,
2025-05-07T20:31:45.0022371Z         compiled: bool,
2025-05-07T20:31:45.0022605Z     ) -> None:
2025-05-07T20:31:45.0022826Z         torch.manual_seed(2025)
2025-05-07T20:31:45.0023074Z     
2025-05-07T20:31:45.0023359Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.0023701Z     
2025-05-07T20:31:45.0023903Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.0024201Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.0024512Z         x = x_sign * x_clamp
2025-05-07T20:31:45.0024760Z         x0 = x[:, :D]
2025-05-07T20:31:45.0024985Z         x1 = x[:, D:]
2025-05-07T20:31:45.0025199Z     
2025-05-07T20:31:45.0025387Z         if contiguous:
2025-05-07T20:31:45.0025627Z             x0 = x0.contiguous()
2025-05-07T20:31:45.0025893Z             x1 = x1.contiguous()
2025-05-07T20:31:45.0026134Z     
2025-05-07T20:31:45.0026335Z         if scale_ub is not None:
2025-05-07T20:31:45.0026615Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.0026964Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.0027279Z             )
2025-05-07T20:31:45.0027480Z         else:
2025-05-07T20:31:45.0027695Z             scale_ub_tensor = None
2025-05-07T20:31:45.0027955Z     
2025-05-07T20:31:45.0028201Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.0028524Z             op = silu_mul_quant
2025-05-07T20:31:45.0028786Z             if compiled:
2025-05-07T20:31:45.0029040Z                 op = torch.compile(op)
2025-05-07T20:31:45.0029338Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.0029618Z     
2025-05-07T20:31:45.0029819Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.0029988Z 
2025-05-07T20:31:45.0030095Z moe/activation_test.py:117: 
2025-05-07T20:31:45.0030390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.0030726Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.0031020Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.0031596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.0032174Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.0032936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.0033646Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.0034189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.0034888Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.0035601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.0036163Z     kernel = self.compile(
2025-05-07T20:31:45.0036722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.0037402Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.0037805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.0038039Z 
2025-05-07T20:31:45.0038253Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f147790>
2025-05-07T20:31:45.0039357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.0040766Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f61fac0>}
2025-05-07T20:31:45.0042221Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.0043275Z context = <triton._C.libtriton.ir.context object at 0x7fc50f1bfcb0>
2025-05-07T20:31:45.0043575Z 
2025-05-07T20:31:45.0043752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.0044290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.0044772Z                            module_map=module_map)
2025-05-07T20:31:45.0045143Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.0045506Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.0045778Z E       ^
2025-05-07T20:31:45.0046251Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.0046717Z 
2025-05-07T20:31:45.0047146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.0047679Z 
2025-05-07T20:31:45.1329313Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.1329773Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.1330230Z     T=2048,
2025-05-07T20:31:45.1330425Z     D=7168,
2025-05-07T20:31:45.1330619Z     scale_ub=1200.0,
2025-05-07T20:31:45.1330857Z     contiguous=False,
2025-05-07T20:31:45.1331096Z     compiled=False,
2025-05-07T20:31:45.1331305Z )
2025-05-07T20:31:45.1331639Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.1332150Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.1332432Z 
2025-05-07T20:31:45.1332517Z     @given(
2025-05-07T20:31:45.1332751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.1333073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.1333387Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.1333720Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.1334062Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.1334360Z     )
2025-05-07T20:31:45.1334717Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.1335340Z     def test_silu_mul_quant(
2025-05-07T20:31:45.1335589Z         self,
2025-05-07T20:31:45.1335789Z         T: int,
2025-05-07T20:31:45.1335985Z         D: int,
2025-05-07T20:31:45.1336208Z         scale_ub: Optional[float],
2025-05-07T20:31:45.1336491Z         contiguous: bool,
2025-05-07T20:31:45.1336743Z         compiled: bool,
2025-05-07T20:31:45.1336972Z     ) -> None:
2025-05-07T20:31:45.1337194Z         torch.manual_seed(2025)
2025-05-07T20:31:45.1337440Z     
2025-05-07T20:31:45.1337725Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.1338144Z     
2025-05-07T20:31:45.1338341Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.1338639Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.1338960Z         x = x_sign * x_clamp
2025-05-07T20:31:45.1339203Z         x0 = x[:, :D]
2025-05-07T20:31:45.1339426Z         x1 = x[:, D:]
2025-05-07T20:31:45.1339647Z     
2025-05-07T20:31:45.1339843Z         if contiguous:
2025-05-07T20:31:45.1340092Z             x0 = x0.contiguous()
2025-05-07T20:31:45.1340360Z             x1 = x1.contiguous()
2025-05-07T20:31:45.1340602Z     
2025-05-07T20:31:45.1340803Z         if scale_ub is not None:
2025-05-07T20:31:45.1341089Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.1341438Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.1341752Z             )
2025-05-07T20:31:45.1341957Z         else:
2025-05-07T20:31:45.1342179Z             scale_ub_tensor = None
2025-05-07T20:31:45.1342436Z     
2025-05-07T20:31:45.1342681Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.1343005Z             op = silu_mul_quant
2025-05-07T20:31:45.1343256Z             if compiled:
2025-05-07T20:31:45.1343635Z                 op = torch.compile(op)
2025-05-07T20:31:45.1343944Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.1344225Z     
2025-05-07T20:31:45.1344428Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.1344596Z 
2025-05-07T20:31:45.1344704Z moe/activation_test.py:117: 
2025-05-07T20:31:45.1344999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.1345343Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.1345685Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.1346399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.1347103Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.1347656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.1348356Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.1349041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.1349583Z     kernel = self.compile(
2025-05-07T20:31:45.1350147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.1350823Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.1351226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.1351463Z 
2025-05-07T20:31:45.1351677Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f1ccfd0>
2025-05-07T20:31:45.1352788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.1354207Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc5144ce200>}
2025-05-07T20:31:45.1355870Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.1357061Z context = <triton._C.libtriton.ir.context object at 0x7fc50f2031f0>
2025-05-07T20:31:45.1357366Z 
2025-05-07T20:31:45.1357539Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.1358075Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.1358560Z                            module_map=module_map)
2025-05-07T20:31:45.1358931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.1359297Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.1359566Z E       ^
2025-05-07T20:31:45.1360046Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.1360516Z 
2025-05-07T20:31:45.1360955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.1361488Z 
2025-05-07T20:31:45.1361600Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.1362034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.1362444Z     T=1,
2025-05-07T20:31:45.1362637Z     D=7168,
2025-05-07T20:31:45.1362843Z     scale_ub=None,
2025-05-07T20:31:45.1363065Z     contiguous=True,
2025-05-07T20:31:45.1363301Z     compiled=False,
2025-05-07T20:31:45.1363520Z )
2025-05-07T20:31:45.1370711Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.1371250Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:45.1371524Z 
2025-05-07T20:31:45.1371760Z     @given(
2025-05-07T20:31:45.1371999Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.1372306Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.1372619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.1372948Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.1373269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.1373556Z     )
2025-05-07T20:31:45.1373913Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.1374358Z     def test_silu_mul_quant(
2025-05-07T20:31:45.1374601Z         self,
2025-05-07T20:31:45.1374792Z         T: int,
2025-05-07T20:31:45.1374983Z         D: int,
2025-05-07T20:31:45.1375207Z         scale_ub: Optional[float],
2025-05-07T20:31:45.1375502Z         contiguous: bool,
2025-05-07T20:31:45.1375774Z         compiled: bool,
2025-05-07T20:31:45.1375998Z     ) -> None:
2025-05-07T20:31:45.1376217Z         torch.manual_seed(2025)
2025-05-07T20:31:45.1376469Z     
2025-05-07T20:31:45.1376744Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.1377089Z     
2025-05-07T20:31:45.1377292Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.1377581Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.1377899Z         x = x_sign * x_clamp
2025-05-07T20:31:45.1378222Z         x0 = x[:, :D]
2025-05-07T20:31:45.1378434Z         x1 = x[:, D:]
2025-05-07T20:31:45.1378644Z     
2025-05-07T20:31:45.1378841Z         if contiguous:
2025-05-07T20:31:45.1379069Z             x0 = x0.contiguous()
2025-05-07T20:31:45.1379331Z             x1 = x1.contiguous()
2025-05-07T20:31:45.1379571Z     
2025-05-07T20:31:45.1379759Z         if scale_ub is not None:
2025-05-07T20:31:45.1380039Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.1380380Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.1380691Z             )
2025-05-07T20:31:45.1380881Z         else:
2025-05-07T20:31:45.1381099Z             scale_ub_tensor = None
2025-05-07T20:31:45.1381359Z     
2025-05-07T20:31:45.1381594Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.1382002Z             op = silu_mul_quant
2025-05-07T20:31:45.1382255Z             if compiled:
2025-05-07T20:31:45.1382501Z                 op = torch.compile(op)
2025-05-07T20:31:45.1382801Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.1383081Z     
2025-05-07T20:31:45.1383275Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.1383452Z 
2025-05-07T20:31:45.1383554Z moe/activation_test.py:117: 
2025-05-07T20:31:45.1383853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.1384182Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.1384474Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.1385183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.1385942Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.1386486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.1387190Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.1387863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.1388408Z     kernel = self.compile(
2025-05-07T20:31:45.1388953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.1389619Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.1390017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.1390247Z 
2025-05-07T20:31:45.1390457Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f1cd7e0>
2025-05-07T20:31:45.1391639Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.1393039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f2444c0>}
2025-05-07T20:31:45.1394395Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.1395438Z context = <triton._C.libtriton.ir.context object at 0x7fc50f2a58f0>
2025-05-07T20:31:45.1395732Z 
2025-05-07T20:31:45.1395900Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.1396431Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.1396902Z                            module_map=module_map)
2025-05-07T20:31:45.1397273Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.1397633Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.1397894Z E       ^
2025-05-07T20:31:45.1398362Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.1398823Z 
2025-05-07T20:31:45.1399245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.1399764Z 
2025-05-07T20:31:45.1399875Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.1400290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.1400703Z     T=16384,
2025-05-07T20:31:45.1400903Z     D=7168,
2025-05-07T20:31:45.1401106Z     scale_ub=1200.0,
2025-05-07T20:31:45.1401335Z     contiguous=False,
2025-05-07T20:31:45.1401577Z     compiled=True,
2025-05-07T20:31:45.4041047Z )
2025-05-07T20:31:45.4042023Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4043265Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:45.4043673Z 
2025-05-07T20:31:45.4043791Z     @given(
2025-05-07T20:31:45.4044115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4044454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4044773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4045163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4045518Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4045817Z     )
2025-05-07T20:31:45.4046181Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4046646Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4046903Z         self,
2025-05-07T20:31:45.4047119Z         T: int,
2025-05-07T20:31:45.4047334Z         D: int,
2025-05-07T20:31:45.4047595Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4047887Z         contiguous: bool,
2025-05-07T20:31:45.4048148Z         compiled: bool,
2025-05-07T20:31:45.4048382Z     ) -> None:
2025-05-07T20:31:45.4048615Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4048867Z     
2025-05-07T20:31:45.4049151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4049508Z     
2025-05-07T20:31:45.4049720Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4050023Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4050363Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4050627Z         x0 = x[:, :D]
2025-05-07T20:31:45.4050853Z         x1 = x[:, D:]
2025-05-07T20:31:45.4051082Z     
2025-05-07T20:31:45.4051285Z         if contiguous:
2025-05-07T20:31:45.4051522Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4051978Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4052234Z     
2025-05-07T20:31:45.4052433Z         if scale_ub is not None:
2025-05-07T20:31:45.4052720Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4053078Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4053409Z             )
2025-05-07T20:31:45.4053611Z         else:
2025-05-07T20:31:45.4053844Z             scale_ub_tensor = None
2025-05-07T20:31:45.4054111Z     
2025-05-07T20:31:45.4054350Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4054680Z             op = silu_mul_quant
2025-05-07T20:31:45.4054947Z             if compiled:
2025-05-07T20:31:45.4055217Z                 op = torch.compile(op)
2025-05-07T20:31:45.4055880Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4056175Z     
2025-05-07T20:31:45.4056375Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4056554Z 
2025-05-07T20:31:45.4056660Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4056977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4057309Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4057606Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4058313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.4058894Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.4059565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4060274Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4060830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4061529Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4062200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4062756Z     kernel = self.compile(
2025-05-07T20:31:45.4063315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4064157Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4064570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4064805Z 
2025-05-07T20:31:45.4065020Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f2eb9a0>
2025-05-07T20:31:45.4066131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4067545Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f2455a0>}
2025-05-07T20:31:45.4068923Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4069981Z context = <triton._C.libtriton.ir.context object at 0x7fc50eeb8d30>
2025-05-07T20:31:45.4070273Z 
2025-05-07T20:31:45.4070452Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4070989Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4071467Z                            module_map=module_map)
2025-05-07T20:31:45.4071848Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4072220Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4072482Z E       ^
2025-05-07T20:31:45.4072960Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4073535Z 
2025-05-07T20:31:45.4073972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4074500Z 
2025-05-07T20:31:45.4074614Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4075054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4075499Z     T=1,
2025-05-07T20:31:45.4075696Z     D=7168,
2025-05-07T20:31:45.4075891Z     scale_ub=None,
2025-05-07T20:31:45.4076117Z     contiguous=False,
2025-05-07T20:31:45.4076354Z     compiled=False,
2025-05-07T20:31:45.4076565Z )
2025-05-07T20:31:45.4076898Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.4077398Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:45.4077664Z 
2025-05-07T20:31:45.4077752Z     @given(
2025-05-07T20:31:45.4077983Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.4078314Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.4078630Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.4078966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.4079314Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.4079610Z     )
2025-05-07T20:31:45.4079966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.4080424Z     def test_silu_mul_quant(
2025-05-07T20:31:45.4080676Z         self,
2025-05-07T20:31:45.4080872Z         T: int,
2025-05-07T20:31:45.4081075Z         D: int,
2025-05-07T20:31:45.4081299Z         scale_ub: Optional[float],
2025-05-07T20:31:45.4081571Z         contiguous: bool,
2025-05-07T20:31:45.4081820Z         compiled: bool,
2025-05-07T20:31:45.4082067Z     ) -> None:
2025-05-07T20:31:45.4082296Z         torch.manual_seed(2025)
2025-05-07T20:31:45.4082550Z     
2025-05-07T20:31:45.4082829Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.4083189Z     
2025-05-07T20:31:45.4083395Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.4083691Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.4084101Z         x = x_sign * x_clamp
2025-05-07T20:31:45.4084350Z         x0 = x[:, :D]
2025-05-07T20:31:45.4084573Z         x1 = x[:, D:]
2025-05-07T20:31:45.4084793Z     
2025-05-07T20:31:45.4084996Z         if contiguous:
2025-05-07T20:31:45.4085231Z             x0 = x0.contiguous()
2025-05-07T20:31:45.4085548Z             x1 = x1.contiguous()
2025-05-07T20:31:45.4085799Z     
2025-05-07T20:31:45.4085996Z         if scale_ub is not None:
2025-05-07T20:31:45.4086284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.4086633Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.4086954Z             )
2025-05-07T20:31:45.4087149Z         else:
2025-05-07T20:31:45.4087370Z             scale_ub_tensor = None
2025-05-07T20:31:45.4087633Z     
2025-05-07T20:31:45.4087873Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.4088198Z             op = silu_mul_quant
2025-05-07T20:31:45.4088454Z             if compiled:
2025-05-07T20:31:45.4088713Z                 op = torch.compile(op)
2025-05-07T20:31:45.4089019Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4089302Z     
2025-05-07T20:31:45.4089501Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.4089679Z 
2025-05-07T20:31:45.4089784Z moe/activation_test.py:117: 
2025-05-07T20:31:45.4090087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4090426Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.4090711Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.4091416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.4092121Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.4092749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.4093449Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.4094132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.4094682Z     kernel = self.compile(
2025-05-07T20:31:45.4095231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.4095902Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.4096308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.4096535Z 
2025-05-07T20:31:45.4096746Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ee7f8e0>
2025-05-07T20:31:45.4097851Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.4099322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f245d80>}
2025-05-07T20:31:45.4100699Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.4101745Z context = <triton._C.libtriton.ir.context object at 0x7fc50eec9570>
2025-05-07T20:31:45.4102038Z 
2025-05-07T20:31:45.4102210Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.4102747Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.4103230Z                            module_map=module_map)
2025-05-07T20:31:45.4103612Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.4103973Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.4104243Z E       ^
2025-05-07T20:31:45.4104809Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.4105318Z 
2025-05-07T20:31:45.4105743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.4106270Z 
2025-05-07T20:31:45.4106382Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.4106810Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.4107222Z     T=2048,
2025-05-07T20:31:45.4107415Z     D=7168,
2025-05-07T20:31:45.4107619Z     scale_ub=None,
2025-05-07T20:31:45.4107848Z     contiguous=False,
2025-05-07T20:31:45.4108075Z     compiled=True,
2025-05-07T20:31:45.4108285Z )
2025-05-07T20:31:45.5115834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5116636Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.5117045Z 
2025-05-07T20:31:45.5117160Z     @given(
2025-05-07T20:31:45.5117496Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5117842Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5118164Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5118511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5118850Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5119148Z     )
2025-05-07T20:31:45.5119516Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5119965Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5120224Z         self,
2025-05-07T20:31:45.5120436Z         T: int,
2025-05-07T20:31:45.5120639Z         D: int,
2025-05-07T20:31:45.5120876Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5121835Z         contiguous: bool,
2025-05-07T20:31:45.5122093Z         compiled: bool,
2025-05-07T20:31:45.5122334Z     ) -> None:
2025-05-07T20:31:45.5122573Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5122828Z     
2025-05-07T20:31:45.5123105Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5123461Z     
2025-05-07T20:31:45.5123671Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5123972Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5124299Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5124555Z         x0 = x[:, :D]
2025-05-07T20:31:45.5124777Z         x1 = x[:, D:]
2025-05-07T20:31:45.5125002Z     
2025-05-07T20:31:45.5125224Z         if contiguous:
2025-05-07T20:31:45.5125486Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5125759Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5126014Z     
2025-05-07T20:31:45.5126215Z         if scale_ub is not None:
2025-05-07T20:31:45.5126511Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5126869Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5127187Z             )
2025-05-07T20:31:45.5127406Z         else:
2025-05-07T20:31:45.5127644Z             scale_ub_tensor = None
2025-05-07T20:31:45.5127909Z     
2025-05-07T20:31:45.5128161Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5128494Z             op = silu_mul_quant
2025-05-07T20:31:45.5128761Z             if compiled:
2025-05-07T20:31:45.5129015Z                 op = torch.compile(op)
2025-05-07T20:31:45.5129327Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5129616Z     
2025-05-07T20:31:45.5129818Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5129997Z 
2025-05-07T20:31:45.5130102Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5130410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5130744Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5131047Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5131643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5132392Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5133067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5133775Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5134330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5135019Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5135753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5136305Z     kernel = self.compile(
2025-05-07T20:31:45.5136872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5137541Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5137958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5138272Z 
2025-05-07T20:31:45.5138493Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50eebf9d0>
2025-05-07T20:31:45.5139600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5141000Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f246f80>}
2025-05-07T20:31:45.5142491Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5143547Z context = <triton._C.libtriton.ir.context object at 0x7fc50ee5a5b0>
2025-05-07T20:31:45.5143848Z 
2025-05-07T20:31:45.5144029Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5144563Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5145049Z                            module_map=module_map)
2025-05-07T20:31:45.5145432Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5145806Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5146072Z E       ^
2025-05-07T20:31:45.5146557Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5147017Z 
2025-05-07T20:31:45.5147460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5147981Z 
2025-05-07T20:31:45.5148101Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.5148524Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.5148947Z     T=4096,
2025-05-07T20:31:45.5149152Z     D=7168,
2025-05-07T20:31:45.5149353Z     scale_ub=None,
2025-05-07T20:31:45.5149585Z     contiguous=False,
2025-05-07T20:31:45.5149826Z     compiled=True,
2025-05-07T20:31:45.5150041Z )
2025-05-07T20:31:45.5150376Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.5150889Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:45.5151167Z 
2025-05-07T20:31:45.5151246Z     @given(
2025-05-07T20:31:45.5151490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.5151819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.5152142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.5152489Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.5152833Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.5153130Z     )
2025-05-07T20:31:45.5153574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.5154030Z     def test_silu_mul_quant(
2025-05-07T20:31:45.5154284Z         self,
2025-05-07T20:31:45.5154482Z         T: int,
2025-05-07T20:31:45.5154690Z         D: int,
2025-05-07T20:31:45.5154923Z         scale_ub: Optional[float],
2025-05-07T20:31:45.5155230Z         contiguous: bool,
2025-05-07T20:31:45.5155506Z         compiled: bool,
2025-05-07T20:31:45.5156034Z     ) -> None:
2025-05-07T20:31:45.5156254Z         torch.manual_seed(2025)
2025-05-07T20:31:45.5156506Z     
2025-05-07T20:31:45.5156794Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.5157139Z     
2025-05-07T20:31:45.5157349Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.5157657Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.5157979Z         x = x_sign * x_clamp
2025-05-07T20:31:45.5158223Z         x0 = x[:, :D]
2025-05-07T20:31:45.5158457Z         x1 = x[:, D:]
2025-05-07T20:31:45.5158677Z     
2025-05-07T20:31:45.5158869Z         if contiguous:
2025-05-07T20:31:45.5159113Z             x0 = x0.contiguous()
2025-05-07T20:31:45.5159384Z             x1 = x1.contiguous()
2025-05-07T20:31:45.5159627Z     
2025-05-07T20:31:45.5159830Z         if scale_ub is not None:
2025-05-07T20:31:45.5160114Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.5160460Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.5160780Z             )
2025-05-07T20:31:45.5160985Z         else:
2025-05-07T20:31:45.5161202Z             scale_ub_tensor = None
2025-05-07T20:31:45.5161468Z     
2025-05-07T20:31:45.5161717Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.5162033Z             op = silu_mul_quant
2025-05-07T20:31:45.5162430Z             if compiled:
2025-05-07T20:31:45.5171378Z                 op = torch.compile(op)
2025-05-07T20:31:45.5171761Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5172054Z     
2025-05-07T20:31:45.5172261Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.5172434Z 
2025-05-07T20:31:45.5172549Z moe/activation_test.py:117: 
2025-05-07T20:31:45.5172852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5173200Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.5173499Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.5174070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.5174650Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.5175331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.5176046Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.5176591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.5177300Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.5177983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.5178629Z     kernel = self.compile(
2025-05-07T20:31:45.5179194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.5179873Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.5180286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.5180516Z 
2025-05-07T20:31:45.5180732Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ef33a30>
2025-05-07T20:31:45.5181847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.5183444Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f247e20>}
2025-05-07T20:31:45.5184819Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.5185922Z context = <triton._C.libtriton.ir.context object at 0x7fc50ef1a5b0>
2025-05-07T20:31:45.5186216Z 
2025-05-07T20:31:45.5186388Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.5186927Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.5187418Z                            module_map=module_map)
2025-05-07T20:31:45.5187790Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.5188162Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.5188444Z E       ^
2025-05-07T20:31:45.5188928Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.5189388Z 
2025-05-07T20:31:45.5189814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.5190345Z 
2025-05-07T20:31:45.8964868Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.8965776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.8966214Z     T=16384,
2025-05-07T20:31:45.8966429Z     D=5120,
2025-05-07T20:31:45.8966636Z     scale_ub=1200.0,
2025-05-07T20:31:45.8966881Z     contiguous=False,
2025-05-07T20:31:45.8967128Z     compiled=False,
2025-05-07T20:31:45.8967668Z )
2025-05-07T20:31:45.8968016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.8968544Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:45.8968849Z 
2025-05-07T20:31:45.8968942Z     @given(
2025-05-07T20:31:45.8969182Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.8969513Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.8969843Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.8970184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.8970530Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.8970831Z     )
2025-05-07T20:31:45.8971201Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.8971651Z     def test_silu_mul_quant(
2025-05-07T20:31:45.8971903Z         self,
2025-05-07T20:31:45.8972114Z         T: int,
2025-05-07T20:31:45.8972316Z         D: int,
2025-05-07T20:31:45.8972559Z         scale_ub: Optional[float],
2025-05-07T20:31:45.8972848Z         contiguous: bool,
2025-05-07T20:31:45.8973098Z         compiled: bool,
2025-05-07T20:31:45.8973348Z     ) -> None:
2025-05-07T20:31:45.8973580Z         torch.manual_seed(2025)
2025-05-07T20:31:45.8973828Z     
2025-05-07T20:31:45.8974117Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.8974477Z     
2025-05-07T20:31:45.8974680Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.8974994Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.8975319Z         x = x_sign * x_clamp
2025-05-07T20:31:45.8975565Z         x0 = x[:, :D]
2025-05-07T20:31:45.8975800Z         x1 = x[:, D:]
2025-05-07T20:31:45.8976022Z     
2025-05-07T20:31:45.8976227Z         if contiguous:
2025-05-07T20:31:45.8976472Z             x0 = x0.contiguous()
2025-05-07T20:31:45.8976747Z             x1 = x1.contiguous()
2025-05-07T20:31:45.8977002Z     
2025-05-07T20:31:45.8977208Z         if scale_ub is not None:
2025-05-07T20:31:45.8977498Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.8977852Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.8978412Z             )
2025-05-07T20:31:45.8978620Z         else:
2025-05-07T20:31:45.8978852Z             scale_ub_tensor = None
2025-05-07T20:31:45.8979108Z     
2025-05-07T20:31:45.8979352Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.8979677Z             op = silu_mul_quant
2025-05-07T20:31:45.8979929Z             if compiled:
2025-05-07T20:31:45.8980184Z                 op = torch.compile(op)
2025-05-07T20:31:45.8980490Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.8980765Z     
2025-05-07T20:31:45.8980967Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.8981141Z 
2025-05-07T20:31:45.8981247Z moe/activation_test.py:117: 
2025-05-07T20:31:45.8981551Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.8981890Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.8982182Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.8982893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.8983600Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.8984152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.8984855Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.8985589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.8986128Z     kernel = self.compile(
2025-05-07T20:31:45.8986693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.8987454Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.8987857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.8988092Z 
2025-05-07T20:31:45.8988310Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ef8d930>
2025-05-07T20:31:45.8989418Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.8990842Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ef417e0>}
2025-05-07T20:31:45.8992219Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.8993271Z context = <triton._C.libtriton.ir.context object at 0x7fc50ef924f0>
2025-05-07T20:31:45.8993571Z 
2025-05-07T20:31:45.8993743Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.8994281Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.8994764Z                            module_map=module_map)
2025-05-07T20:31:45.8995137Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.8995537Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.8995821Z E       ^
2025-05-07T20:31:45.8996291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.8996755Z 
2025-05-07T20:31:45.8997179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.8997709Z 
2025-05-07T20:31:45.8997816Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:45.8998250Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:45.8998656Z     T=16384,
2025-05-07T20:31:45.8998942Z     D=5120,
2025-05-07T20:31:45.8999145Z     scale_ub=1200.0,
2025-05-07T20:31:45.8999370Z     contiguous=True,
2025-05-07T20:31:45.8999601Z     compiled=True,
2025-05-07T20:31:45.8999815Z )
2025-05-07T20:31:45.9000140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:45.9000651Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:45.9000943Z 
2025-05-07T20:31:45.9001024Z     @given(
2025-05-07T20:31:45.9001263Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:45.9001577Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:45.9001892Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:45.9002235Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:45.9002573Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:45.9002866Z     )
2025-05-07T20:31:45.9003224Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:45.9003675Z     def test_silu_mul_quant(
2025-05-07T20:31:45.9003923Z         self,
2025-05-07T20:31:45.9004124Z         T: int,
2025-05-07T20:31:45.9004324Z         D: int,
2025-05-07T20:31:45.9004551Z         scale_ub: Optional[float],
2025-05-07T20:31:45.9004829Z         contiguous: bool,
2025-05-07T20:31:45.9005072Z         compiled: bool,
2025-05-07T20:31:45.9005303Z     ) -> None:
2025-05-07T20:31:45.9005526Z         torch.manual_seed(2025)
2025-05-07T20:31:45.9005776Z     
2025-05-07T20:31:45.9006056Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:45.9006406Z     
2025-05-07T20:31:45.9006621Z         x_sign = torch.sign(x)
2025-05-07T20:31:45.9006916Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:45.9007231Z         x = x_sign * x_clamp
2025-05-07T20:31:45.9007562Z         x0 = x[:, :D]
2025-05-07T20:31:45.9007783Z         x1 = x[:, D:]
2025-05-07T20:31:45.9007998Z     
2025-05-07T20:31:45.9008191Z         if contiguous:
2025-05-07T20:31:45.9008431Z             x0 = x0.contiguous()
2025-05-07T20:31:45.9008696Z             x1 = x1.contiguous()
2025-05-07T20:31:45.9008942Z     
2025-05-07T20:31:45.9009139Z         if scale_ub is not None:
2025-05-07T20:31:45.9009421Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:45.9009767Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:45.9010080Z             )
2025-05-07T20:31:45.9010284Z         else:
2025-05-07T20:31:45.9010504Z             scale_ub_tensor = None
2025-05-07T20:31:45.9010771Z     
2025-05-07T20:31:45.9011008Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:45.9011332Z             op = silu_mul_quant
2025-05-07T20:31:45.9011593Z             if compiled:
2025-05-07T20:31:45.9011843Z                 op = torch.compile(op)
2025-05-07T20:31:45.9012157Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.9012445Z     
2025-05-07T20:31:45.9012642Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:45.9012820Z 
2025-05-07T20:31:45.9012923Z moe/activation_test.py:117: 
2025-05-07T20:31:45.9013225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.9013555Z moe/activation_test.py:115: in fn
2025-05-07T20:31:45.9013847Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:45.9014421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:45.9014995Z     return fn(*args, **kwargs)
2025-05-07T20:31:45.9015719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:45.9016425Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:45.9016980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:45.9017671Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:45.9018537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:45.9019084Z     kernel = self.compile(
2025-05-07T20:31:45.9019644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:45.9020319Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:45.9020718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:45.9020950Z 
2025-05-07T20:31:45.9021166Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ef16f50>
2025-05-07T20:31:45.9022275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:45.9023676Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ef41090>}
2025-05-07T20:31:45.9025051Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:45.9026156Z context = <triton._C.libtriton.ir.context object at 0x7fc50eb8d9b0>
2025-05-07T20:31:45.9026458Z 
2025-05-07T20:31:45.9026631Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:45.9027166Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:45.9027645Z                            module_map=module_map)
2025-05-07T20:31:45.9028136Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:45.9028508Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:45.9028771Z E       ^
2025-05-07T20:31:45.9029248Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:45.9029714Z 
2025-05-07T20:31:45.9030142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:45.9030663Z 
2025-05-07T20:31:46.0924105Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.0924809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.0925355Z     T=16384,
2025-05-07T20:31:46.0925567Z     D=5120,
2025-05-07T20:31:46.0925766Z     scale_ub=None,
2025-05-07T20:31:46.0925991Z     contiguous=False,
2025-05-07T20:31:46.0926219Z     compiled=True,
2025-05-07T20:31:46.0926435Z )
2025-05-07T20:31:46.0926770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.0927313Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.0927609Z 
2025-05-07T20:31:46.0927689Z     @given(
2025-05-07T20:31:46.0927942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.0928262Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.0928581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.0928923Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.0929255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.0929551Z     )
2025-05-07T20:31:46.0929914Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.0930369Z     def test_silu_mul_quant(
2025-05-07T20:31:46.0930619Z         self,
2025-05-07T20:31:46.0930820Z         T: int,
2025-05-07T20:31:46.0931023Z         D: int,
2025-05-07T20:31:46.0931242Z         scale_ub: Optional[float],
2025-05-07T20:31:46.0931524Z         contiguous: bool,
2025-05-07T20:31:46.0931775Z         compiled: bool,
2025-05-07T20:31:46.0932005Z     ) -> None:
2025-05-07T20:31:46.0932233Z         torch.manual_seed(2025)
2025-05-07T20:31:46.0932843Z     
2025-05-07T20:31:46.0933122Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.0933474Z     
2025-05-07T20:31:46.0933681Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.0933977Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.0934295Z         x = x_sign * x_clamp
2025-05-07T20:31:46.0934547Z         x0 = x[:, :D]
2025-05-07T20:31:46.0934767Z         x1 = x[:, D:]
2025-05-07T20:31:46.0934981Z     
2025-05-07T20:31:46.0935185Z         if contiguous:
2025-05-07T20:31:46.0935461Z             x0 = x0.contiguous()
2025-05-07T20:31:46.0935736Z             x1 = x1.contiguous()
2025-05-07T20:31:46.0935992Z     
2025-05-07T20:31:46.0936194Z         if scale_ub is not None:
2025-05-07T20:31:46.0936472Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.0936824Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.0937140Z             )
2025-05-07T20:31:46.0937335Z         else:
2025-05-07T20:31:46.0937567Z             scale_ub_tensor = None
2025-05-07T20:31:46.0937833Z     
2025-05-07T20:31:46.0938172Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.0938496Z             op = silu_mul_quant
2025-05-07T20:31:46.0938752Z             if compiled:
2025-05-07T20:31:46.0939002Z                 op = torch.compile(op)
2025-05-07T20:31:46.0939310Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.0939596Z     
2025-05-07T20:31:46.0939793Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.0939968Z 
2025-05-07T20:31:46.0940071Z moe/activation_test.py:117: 
2025-05-07T20:31:46.0940373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.0940714Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.0941001Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.0941779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.0942377Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.0943050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.0943758Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.0944315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.0945017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.0945753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.0946297Z     kernel = self.compile(
2025-05-07T20:31:46.0946858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.0947537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.0947944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.0948188Z 
2025-05-07T20:31:46.0948403Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ef8f4c0>
2025-05-07T20:31:46.0949515Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.0950951Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ef42290>}
2025-05-07T20:31:46.0952331Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.0953390Z context = <triton._C.libtriton.ir.context object at 0x7fc50eb1acf0>
2025-05-07T20:31:46.0953693Z 
2025-05-07T20:31:46.0953956Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.0954496Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.0954976Z                            module_map=module_map)
2025-05-07T20:31:46.0955353Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.0956008Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.0956274Z E       ^
2025-05-07T20:31:46.0956754Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.0957219Z 
2025-05-07T20:31:46.0957646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.0958173Z 
2025-05-07T20:31:46.0958295Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.0958718Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.0959140Z     T=2048,
2025-05-07T20:31:46.0959341Z     D=5120,
2025-05-07T20:31:46.0959535Z     scale_ub=None,
2025-05-07T20:31:46.0959761Z     contiguous=False,
2025-05-07T20:31:46.0959995Z     compiled=True,
2025-05-07T20:31:46.0960198Z )
2025-05-07T20:31:46.1997651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.1998416Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:46.1998804Z 
2025-05-07T20:31:46.1998924Z     @given(
2025-05-07T20:31:46.1999247Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.1999692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.2000012Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.2000360Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.2001029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.2001330Z     )
2025-05-07T20:31:46.2001690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.2002147Z     def test_silu_mul_quant(
2025-05-07T20:31:46.2002396Z         self,
2025-05-07T20:31:46.2002598Z         T: int,
2025-05-07T20:31:46.2002795Z         D: int,
2025-05-07T20:31:46.2003019Z         scale_ub: Optional[float],
2025-05-07T20:31:46.2003303Z         contiguous: bool,
2025-05-07T20:31:46.2003543Z         compiled: bool,
2025-05-07T20:31:46.2003781Z     ) -> None:
2025-05-07T20:31:46.2004007Z         torch.manual_seed(2025)
2025-05-07T20:31:46.2004248Z     
2025-05-07T20:31:46.2004531Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.2004878Z     
2025-05-07T20:31:46.2005072Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.2005370Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.2005694Z         x = x_sign * x_clamp
2025-05-07T20:31:46.2005940Z         x0 = x[:, :D]
2025-05-07T20:31:46.2006162Z         x1 = x[:, D:]
2025-05-07T20:31:46.2006385Z     
2025-05-07T20:31:46.2006579Z         if contiguous:
2025-05-07T20:31:46.2006851Z             x0 = x0.contiguous()
2025-05-07T20:31:46.2007113Z             x1 = x1.contiguous()
2025-05-07T20:31:46.2007363Z     
2025-05-07T20:31:46.2007564Z         if scale_ub is not None:
2025-05-07T20:31:46.2007849Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.2008188Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.2008505Z             )
2025-05-07T20:31:46.2008708Z         else:
2025-05-07T20:31:46.2008921Z             scale_ub_tensor = None
2025-05-07T20:31:46.2009181Z     
2025-05-07T20:31:46.2009422Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.2009744Z             op = silu_mul_quant
2025-05-07T20:31:46.2010005Z             if compiled:
2025-05-07T20:31:46.2010269Z                 op = torch.compile(op)
2025-05-07T20:31:46.2010570Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.2010853Z     
2025-05-07T20:31:46.2011219Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.2011387Z 
2025-05-07T20:31:46.2011490Z moe/activation_test.py:117: 
2025-05-07T20:31:46.2011798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.2012143Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.2012436Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.2013007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.2013585Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.2014265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.2014967Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.2015577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.2016275Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.2016964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.2017505Z     kernel = self.compile(
2025-05-07T20:31:46.2018172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.2018851Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.2019259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.2019487Z 
2025-05-07T20:31:46.2019700Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50eb517b0>
2025-05-07T20:31:46.2020895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.2022318Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ef42170>}
2025-05-07T20:31:46.2023706Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.2024751Z context = <triton._C.libtriton.ir.context object at 0x7fc50edac4f0>
2025-05-07T20:31:46.2025052Z 
2025-05-07T20:31:46.2025224Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.2025810Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.2026293Z                            module_map=module_map)
2025-05-07T20:31:46.2026671Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.2027037Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.2027306Z E       ^
2025-05-07T20:31:46.2027783Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.2028252Z 
2025-05-07T20:31:46.2028680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.2029210Z 
2025-05-07T20:31:46.2029318Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.2029750Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.2030157Z     T=2048,
2025-05-07T20:31:46.2030363Z     D=5120,
2025-05-07T20:31:46.2030571Z     scale_ub=1200.0,
2025-05-07T20:31:46.2030802Z     contiguous=False,
2025-05-07T20:31:46.2031037Z     compiled=True,
2025-05-07T20:31:46.2040127Z )
2025-05-07T20:31:46.2040519Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.2041043Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.2041443Z 
2025-05-07T20:31:46.2041524Z     @given(
2025-05-07T20:31:46.2041767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.2042089Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.2042400Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.2042742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.2043082Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.2043369Z     )
2025-05-07T20:31:46.2043731Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.2044187Z     def test_silu_mul_quant(
2025-05-07T20:31:46.2044442Z         self,
2025-05-07T20:31:46.2044636Z         T: int,
2025-05-07T20:31:46.2044836Z         D: int,
2025-05-07T20:31:46.2045070Z         scale_ub: Optional[float],
2025-05-07T20:31:46.2045370Z         contiguous: bool,
2025-05-07T20:31:46.2045638Z         compiled: bool,
2025-05-07T20:31:46.2045868Z     ) -> None:
2025-05-07T20:31:46.2046092Z         torch.manual_seed(2025)
2025-05-07T20:31:46.2046340Z     
2025-05-07T20:31:46.2046620Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.2046962Z     
2025-05-07T20:31:46.2047170Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.2047468Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.2047777Z         x = x_sign * x_clamp
2025-05-07T20:31:46.2048024Z         x0 = x[:, :D]
2025-05-07T20:31:46.2048250Z         x1 = x[:, D:]
2025-05-07T20:31:46.2048457Z     
2025-05-07T20:31:46.2048652Z         if contiguous:
2025-05-07T20:31:46.2048882Z             x0 = x0.contiguous()
2025-05-07T20:31:46.2049141Z             x1 = x1.contiguous()
2025-05-07T20:31:46.2049395Z     
2025-05-07T20:31:46.2049596Z         if scale_ub is not None:
2025-05-07T20:31:46.2049962Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.2050306Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.2050625Z             )
2025-05-07T20:31:46.2050828Z         else:
2025-05-07T20:31:46.2051037Z             scale_ub_tensor = None
2025-05-07T20:31:46.2051301Z     
2025-05-07T20:31:46.2051542Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.2051861Z             op = silu_mul_quant
2025-05-07T20:31:46.2052117Z             if compiled:
2025-05-07T20:31:46.2052375Z                 op = torch.compile(op)
2025-05-07T20:31:46.2052673Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.2052956Z     
2025-05-07T20:31:46.2053157Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.2053323Z 
2025-05-07T20:31:46.2053427Z moe/activation_test.py:117: 
2025-05-07T20:31:46.2053732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.2054065Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.2054358Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.2054924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.2055827Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.2056621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.2057319Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.2057864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.2058745Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.2059547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.2060179Z     kernel = self.compile(
2025-05-07T20:31:46.2060826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.2061618Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.2062170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.2062404Z 
2025-05-07T20:31:46.2062614Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50edba0e0>
2025-05-07T20:31:46.2063710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.2065109Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ef43880>}
2025-05-07T20:31:46.2066475Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.2067508Z context = <triton._C.libtriton.ir.context object at 0x7fc50edf9970>
2025-05-07T20:31:46.2067809Z 
2025-05-07T20:31:46.2067980Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.2068511Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.2068990Z                            module_map=module_map)
2025-05-07T20:31:46.2069355Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.2069715Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.2069981Z E       ^
2025-05-07T20:31:46.2070451Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.2070910Z 
2025-05-07T20:31:46.2071455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.2071983Z 
2025-05-07T20:31:46.3969936Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.3970583Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.3971175Z     T=4096,
2025-05-07T20:31:46.3971443Z     D=5120,
2025-05-07T20:31:46.3971646Z     scale_ub=1200.0,
2025-05-07T20:31:46.3971874Z     contiguous=True,
2025-05-07T20:31:46.3972103Z     compiled=True,
2025-05-07T20:31:46.3972319Z )
2025-05-07T20:31:46.3972649Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.3973164Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.3973451Z 
2025-05-07T20:31:46.3973532Z     @given(
2025-05-07T20:31:46.3973772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.3974092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.3974412Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.3974767Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.3975103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.3975406Z     )
2025-05-07T20:31:46.3975818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.3976275Z     def test_silu_mul_quant(
2025-05-07T20:31:46.3976520Z         self,
2025-05-07T20:31:46.3976725Z         T: int,
2025-05-07T20:31:46.3976952Z         D: int,
2025-05-07T20:31:46.3977178Z         scale_ub: Optional[float],
2025-05-07T20:31:46.3977455Z         contiguous: bool,
2025-05-07T20:31:46.3977703Z         compiled: bool,
2025-05-07T20:31:46.3977938Z     ) -> None:
2025-05-07T20:31:46.3978268Z         torch.manual_seed(2025)
2025-05-07T20:31:46.3978522Z     
2025-05-07T20:31:46.3978809Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.3979159Z     
2025-05-07T20:31:46.3979365Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.3979673Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.3979990Z         x = x_sign * x_clamp
2025-05-07T20:31:46.3980243Z         x0 = x[:, :D]
2025-05-07T20:31:46.3980823Z         x1 = x[:, D:]
2025-05-07T20:31:46.3981035Z     
2025-05-07T20:31:46.3981240Z         if contiguous:
2025-05-07T20:31:46.3981486Z             x0 = x0.contiguous()
2025-05-07T20:31:46.3981756Z             x1 = x1.contiguous()
2025-05-07T20:31:46.3982004Z     
2025-05-07T20:31:46.3982209Z         if scale_ub is not None:
2025-05-07T20:31:46.3982489Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.3982844Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.3983165Z             )
2025-05-07T20:31:46.3983371Z         else:
2025-05-07T20:31:46.3983586Z             scale_ub_tensor = None
2025-05-07T20:31:46.3983844Z     
2025-05-07T20:31:46.3984091Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.3984410Z             op = silu_mul_quant
2025-05-07T20:31:46.3984674Z             if compiled:
2025-05-07T20:31:46.3984931Z                 op = torch.compile(op)
2025-05-07T20:31:46.3985235Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3985526Z     
2025-05-07T20:31:46.3985729Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.3985898Z 
2025-05-07T20:31:46.3986002Z moe/activation_test.py:117: 
2025-05-07T20:31:46.3986308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3986652Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.3986946Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.3987522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.3988101Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.3988785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.3989634Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.3990196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.3990900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.3991584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.3992131Z     kernel = self.compile(
2025-05-07T20:31:46.3992693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.3993374Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.3993778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.3994018Z 
2025-05-07T20:31:46.3994233Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50edb4100>
2025-05-07T20:31:46.3995377Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.3996847Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ed78940>}
2025-05-07T20:31:46.3998231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.3999283Z context = <triton._C.libtriton.ir.context object at 0x7fc50ea1b170>
2025-05-07T20:31:46.3999585Z 
2025-05-07T20:31:46.3999760Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.4000301Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.4000800Z                            module_map=module_map)
2025-05-07T20:31:46.4001175Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.4001630Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.4001906Z E       ^
2025-05-07T20:31:46.4002384Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.4002858Z 
2025-05-07T20:31:46.4003289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.4003823Z 
2025-05-07T20:31:46.4003933Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.4004364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.4004774Z     T=128,
2025-05-07T20:31:46.4004972Z     D=5120,
2025-05-07T20:31:46.4005175Z     scale_ub=1200.0,
2025-05-07T20:31:46.4005407Z     contiguous=False,
2025-05-07T20:31:46.4005651Z     compiled=True,
2025-05-07T20:31:46.4005876Z )
2025-05-07T20:31:46.5152640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.5153463Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:46.5153878Z 
2025-05-07T20:31:46.5153970Z     @given(
2025-05-07T20:31:46.5154214Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.5154538Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.5154867Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.5155209Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.5155842Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.5156150Z     )
2025-05-07T20:31:46.5156516Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.5156968Z     def test_silu_mul_quant(
2025-05-07T20:31:46.5157219Z         self,
2025-05-07T20:31:46.5157428Z         T: int,
2025-05-07T20:31:46.5157995Z         D: int,
2025-05-07T20:31:46.5158234Z         scale_ub: Optional[float],
2025-05-07T20:31:46.5158519Z         contiguous: bool,
2025-05-07T20:31:46.5158763Z         compiled: bool,
2025-05-07T20:31:46.5159007Z     ) -> None:
2025-05-07T20:31:46.5159238Z         torch.manual_seed(2025)
2025-05-07T20:31:46.5159485Z     
2025-05-07T20:31:46.5159773Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.5160128Z     
2025-05-07T20:31:46.5160325Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.5160634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.5160962Z         x = x_sign * x_clamp
2025-05-07T20:31:46.5161223Z         x0 = x[:, :D]
2025-05-07T20:31:46.5161445Z         x1 = x[:, D:]
2025-05-07T20:31:46.5161662Z     
2025-05-07T20:31:46.5161865Z         if contiguous:
2025-05-07T20:31:46.5162103Z             x0 = x0.contiguous()
2025-05-07T20:31:46.5162374Z             x1 = x1.contiguous()
2025-05-07T20:31:46.5162629Z     
2025-05-07T20:31:46.5162833Z         if scale_ub is not None:
2025-05-07T20:31:46.5163118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.5163470Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.5163787Z             )
2025-05-07T20:31:46.5163991Z         else:
2025-05-07T20:31:46.5164213Z             scale_ub_tensor = None
2025-05-07T20:31:46.5164470Z     
2025-05-07T20:31:46.5164714Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.5165041Z             op = silu_mul_quant
2025-05-07T20:31:46.5165316Z             if compiled:
2025-05-07T20:31:46.5165602Z                 op = torch.compile(op)
2025-05-07T20:31:46.5165912Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.5166189Z     
2025-05-07T20:31:46.5166392Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.5166569Z 
2025-05-07T20:31:46.5166673Z moe/activation_test.py:117: 
2025-05-07T20:31:46.5166984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.5167327Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.5167621Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.5168353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.5168927Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.5169604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.5170320Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.5170876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.5171572Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.5172256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.5172815Z     kernel = self.compile(
2025-05-07T20:31:46.5173378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.5174062Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.5174481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.5174712Z 
2025-05-07T20:31:46.5174933Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ea44880>
2025-05-07T20:31:46.5176095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.5177528Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ed791b0>}
2025-05-07T20:31:46.5179070Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.5180136Z context = <triton._C.libtriton.ir.context object at 0x7fc50ead0470>
2025-05-07T20:31:46.5180433Z 
2025-05-07T20:31:46.5180613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.5181154Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.5181647Z                            module_map=module_map)
2025-05-07T20:31:46.5182031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.5182394Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.5182666Z E       ^
2025-05-07T20:31:46.5183145Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.5183608Z 
2025-05-07T20:31:46.5184048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.5184573Z 
2025-05-07T20:31:46.5184688Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.5185123Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.5185538Z     T=16384,
2025-05-07T20:31:46.5185734Z     D=7168,
2025-05-07T20:31:46.5185940Z     scale_ub=1200.0,
2025-05-07T20:31:46.5186175Z     contiguous=True,
2025-05-07T20:31:46.5186410Z     compiled=True,
2025-05-07T20:31:46.5186619Z )
2025-05-07T20:31:46.5186950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.5187471Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:46.5187756Z 
2025-05-07T20:31:46.5187835Z     @given(
2025-05-07T20:31:46.5188077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.5188401Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.5188719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.5189061Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.5189492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.5189783Z     )
2025-05-07T20:31:46.5190142Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.5190599Z     def test_silu_mul_quant(
2025-05-07T20:31:46.5190850Z         self,
2025-05-07T20:31:46.5191046Z         T: int,
2025-05-07T20:31:46.5191250Z         D: int,
2025-05-07T20:31:46.5191476Z         scale_ub: Optional[float],
2025-05-07T20:31:46.5191750Z         contiguous: bool,
2025-05-07T20:31:46.5192002Z         compiled: bool,
2025-05-07T20:31:46.5192238Z     ) -> None:
2025-05-07T20:31:46.5192457Z         torch.manual_seed(2025)
2025-05-07T20:31:46.5192707Z     
2025-05-07T20:31:46.5192993Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.5193341Z     
2025-05-07T20:31:46.5193551Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.5193856Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.5194171Z         x = x_sign * x_clamp
2025-05-07T20:31:46.5194429Z         x0 = x[:, :D]
2025-05-07T20:31:46.5194655Z         x1 = x[:, D:]
2025-05-07T20:31:46.5194866Z     
2025-05-07T20:31:46.5195062Z         if contiguous:
2025-05-07T20:31:46.5195319Z             x0 = x0.contiguous()
2025-05-07T20:31:46.5195624Z             x1 = x1.contiguous()
2025-05-07T20:31:46.5195870Z     
2025-05-07T20:31:46.5196076Z         if scale_ub is not None:
2025-05-07T20:31:46.5196364Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.5196711Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.5197036Z             )
2025-05-07T20:31:46.5197243Z         else:
2025-05-07T20:31:46.5197456Z             scale_ub_tensor = None
2025-05-07T20:31:46.5197719Z     
2025-05-07T20:31:46.5198049Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.5198370Z             op = silu_mul_quant
2025-05-07T20:31:46.5198629Z             if compiled:
2025-05-07T20:31:46.5198884Z                 op = torch.compile(op)
2025-05-07T20:31:46.5199191Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.5199472Z     
2025-05-07T20:31:46.5199675Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.5199844Z 
2025-05-07T20:31:46.5199946Z moe/activation_test.py:117: 
2025-05-07T20:31:46.5200251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.5200588Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.5200880Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.5201451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:46.5202028Z     return fn(*args, **kwargs)
2025-05-07T20:31:46.5202714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.5203427Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.5203981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.5204685Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.5205378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.5205963Z     kernel = self.compile(
2025-05-07T20:31:46.5206525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.5207204Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.5207618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.5207849Z 
2025-05-07T20:31:46.5208061Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50eac4370>
2025-05-07T20:31:46.5209182Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.5210684Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ed797e0>}
2025-05-07T20:31:46.5212075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.5213128Z context = <triton._C.libtriton.ir.context object at 0x7fc50eaa9970>
2025-05-07T20:31:46.5213429Z 
2025-05-07T20:31:46.5213602Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.5214149Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.5214641Z                            module_map=module_map)
2025-05-07T20:31:46.5215013Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.5215388Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.5215663Z E       ^
2025-05-07T20:31:46.5216143Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.5216613Z 
2025-05-07T20:31:46.5217039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.5217572Z 
2025-05-07T20:31:46.8622536Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8623861Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8624998Z     T=16384,
2025-05-07T20:31:46.8625444Z     D=5120,
2025-05-07T20:31:46.8625759Z     scale_ub=1200.0,
2025-05-07T20:31:46.8626363Z     contiguous=True,
2025-05-07T20:31:46.8626602Z     compiled=False,
2025-05-07T20:31:46.8626821Z )
2025-05-07T20:31:46.8627147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8627676Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:46.8627966Z 
2025-05-07T20:31:46.8628054Z     @given(
2025-05-07T20:31:46.8628288Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8628613Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8628932Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8629277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8629613Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8629912Z     )
2025-05-07T20:31:46.8630275Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8630729Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8630982Z         self,
2025-05-07T20:31:46.8631187Z         T: int,
2025-05-07T20:31:46.8631384Z         D: int,
2025-05-07T20:31:46.8631612Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8631901Z         contiguous: bool,
2025-05-07T20:31:46.8632146Z         compiled: bool,
2025-05-07T20:31:46.8632382Z     ) -> None:
2025-05-07T20:31:46.8632608Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8632853Z     
2025-05-07T20:31:46.8633139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8633495Z     
2025-05-07T20:31:46.8633714Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.8634012Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.8634332Z         x = x_sign * x_clamp
2025-05-07T20:31:46.8634580Z         x0 = x[:, :D]
2025-05-07T20:31:46.8634803Z         x1 = x[:, D:]
2025-05-07T20:31:46.8635025Z     
2025-05-07T20:31:46.8635226Z         if contiguous:
2025-05-07T20:31:46.8643424Z             x0 = x0.contiguous()
2025-05-07T20:31:46.8643722Z             x1 = x1.contiguous()
2025-05-07T20:31:46.8643970Z     
2025-05-07T20:31:46.8644161Z         if scale_ub is not None:
2025-05-07T20:31:46.8644444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.8644994Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.8645308Z             )
2025-05-07T20:31:46.8645534Z         else:
2025-05-07T20:31:46.8645775Z             scale_ub_tensor = None
2025-05-07T20:31:46.8646028Z     
2025-05-07T20:31:46.8646271Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.8646598Z             op = silu_mul_quant
2025-05-07T20:31:46.8646847Z             if compiled:
2025-05-07T20:31:46.8647107Z                 op = torch.compile(op)
2025-05-07T20:31:46.8647415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8647689Z     
2025-05-07T20:31:46.8647889Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.8648064Z 
2025-05-07T20:31:46.8648167Z moe/activation_test.py:117: 
2025-05-07T20:31:46.8648481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8648813Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.8649114Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8649826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.8650528Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.8651077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.8651777Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.8652461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.8653003Z     kernel = self.compile(
2025-05-07T20:31:46.8653648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.8654328Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.8654728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8654968Z 
2025-05-07T20:31:46.8655182Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50eac7640>
2025-05-07T20:31:46.8656616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.8658130Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ed7a950>}
2025-05-07T20:31:46.8659509Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.8660549Z context = <triton._C.libtriton.ir.context object at 0x7fc50e89da70>
2025-05-07T20:31:46.8660856Z 
2025-05-07T20:31:46.8661028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.8661564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.8662051Z                            module_map=module_map)
2025-05-07T20:31:46.8662422Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.8662791Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.8663084Z E       ^
2025-05-07T20:31:46.8663559Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.8664022Z 
2025-05-07T20:31:46.8664446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.8664977Z 
2025-05-07T20:31:46.8665084Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:46.8665515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:46.8666060Z     T=1,
2025-05-07T20:31:46.8666246Z     D=7168,
2025-05-07T20:31:46.8666444Z     scale_ub=1200.0,
2025-05-07T20:31:46.8666675Z     contiguous=False,
2025-05-07T20:31:46.8666904Z     compiled=False,
2025-05-07T20:31:46.8667119Z )
2025-05-07T20:31:46.8667448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:46.8667949Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:46.8668229Z 
2025-05-07T20:31:46.8668305Z     @given(
2025-05-07T20:31:46.8668545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:46.8668864Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:46.8669174Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:46.8669516Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:46.8669851Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:46.8670139Z     )
2025-05-07T20:31:46.8670489Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:46.8670944Z     def test_silu_mul_quant(
2025-05-07T20:31:46.8671191Z         self,
2025-05-07T20:31:46.8671384Z         T: int,
2025-05-07T20:31:46.8671584Z         D: int,
2025-05-07T20:31:46.8671806Z         scale_ub: Optional[float],
2025-05-07T20:31:46.8672079Z         contiguous: bool,
2025-05-07T20:31:46.8672323Z         compiled: bool,
2025-05-07T20:31:46.8672552Z     ) -> None:
2025-05-07T20:31:46.8672770Z         torch.manual_seed(2025)
2025-05-07T20:31:46.8673016Z     
2025-05-07T20:31:46.8673298Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:46.8673637Z     
2025-05-07T20:31:46.8673837Z         x_sign = torch.sign(x)
2025-05-07T20:31:46.8674136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:46.8674569Z         x = x_sign * x_clamp
2025-05-07T20:31:46.8674820Z         x0 = x[:, :D]
2025-05-07T20:31:46.8675042Z         x1 = x[:, D:]
2025-05-07T20:31:46.8675253Z     
2025-05-07T20:31:46.8675459Z         if contiguous:
2025-05-07T20:31:46.8675731Z             x0 = x0.contiguous()
2025-05-07T20:31:46.8675994Z             x1 = x1.contiguous()
2025-05-07T20:31:46.8676232Z     
2025-05-07T20:31:46.8676430Z         if scale_ub is not None:
2025-05-07T20:31:46.8676711Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:46.8677049Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:46.8677363Z             )
2025-05-07T20:31:46.8677565Z         else:
2025-05-07T20:31:46.8677776Z             scale_ub_tensor = None
2025-05-07T20:31:46.8678032Z     
2025-05-07T20:31:46.8678269Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:46.8678582Z             op = silu_mul_quant
2025-05-07T20:31:46.8678835Z             if compiled:
2025-05-07T20:31:46.8679092Z                 op = torch.compile(op)
2025-05-07T20:31:46.8679387Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8679662Z     
2025-05-07T20:31:46.8679864Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:46.8680030Z 
2025-05-07T20:31:46.8680139Z moe/activation_test.py:117: 
2025-05-07T20:31:46.8680432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8680765Z moe/activation_test.py:115: in fn
2025-05-07T20:31:46.8681050Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:46.8681749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:46.8682450Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:46.8682997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:46.8683691Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:46.8684361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:46.8684986Z     kernel = self.compile(
2025-05-07T20:31:46.8685561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:46.8686251Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:46.8686650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:46.8686881Z 
2025-05-07T20:31:46.8687092Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e8192d0>
2025-05-07T20:31:46.8688186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:46.8689574Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ed7bac0>}
2025-05-07T20:31:46.8690943Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:46.8691987Z context = <triton._C.libtriton.ir.context object at 0x7fc50e817e30>
2025-05-07T20:31:46.8692279Z 
2025-05-07T20:31:46.8692456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:46.8692988Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:46.8693459Z                            module_map=module_map)
2025-05-07T20:31:46.8693834Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:46.8694197Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:46.8694456Z E       ^
2025-05-07T20:31:46.8695012Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:46.8695478Z 
2025-05-07T20:31:46.8695912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:46.8696431Z 
2025-05-07T20:31:47.0605926Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.0606618Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.0607192Z     T=4096,
2025-05-07T20:31:47.0607463Z     D=7168,
2025-05-07T20:31:47.0607735Z     scale_ub=1200.0,
2025-05-07T20:31:47.0608031Z     contiguous=False,
2025-05-07T20:31:47.0608335Z     compiled=True,
2025-05-07T20:31:47.0608616Z )
2025-05-07T20:31:47.0608958Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.0609473Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:47.0609790Z 
2025-05-07T20:31:47.0609871Z     @given(
2025-05-07T20:31:47.0610109Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.0610426Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.0610752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.0611096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.0611430Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.0611723Z     )
2025-05-07T20:31:47.0612083Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.0612535Z     def test_silu_mul_quant(
2025-05-07T20:31:47.0612784Z         self,
2025-05-07T20:31:47.0612984Z         T: int,
2025-05-07T20:31:47.0613182Z         D: int,
2025-05-07T20:31:47.0613407Z         scale_ub: Optional[float],
2025-05-07T20:31:47.0613690Z         contiguous: bool,
2025-05-07T20:31:47.0613933Z         compiled: bool,
2025-05-07T20:31:47.0614169Z     ) -> None:
2025-05-07T20:31:47.0614398Z         torch.manual_seed(2025)
2025-05-07T20:31:47.0614650Z     
2025-05-07T20:31:47.0614927Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.0615626Z     
2025-05-07T20:31:47.0615829Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.0616123Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.0616440Z         x = x_sign * x_clamp
2025-05-07T20:31:47.0616685Z         x0 = x[:, :D]
2025-05-07T20:31:47.0616901Z         x1 = x[:, D:]
2025-05-07T20:31:47.0617113Z     
2025-05-07T20:31:47.0617309Z         if contiguous:
2025-05-07T20:31:47.0617545Z             x0 = x0.contiguous()
2025-05-07T20:31:47.0617814Z             x1 = x1.contiguous()
2025-05-07T20:31:47.0618156Z     
2025-05-07T20:31:47.0618399Z         if scale_ub is not None:
2025-05-07T20:31:47.0618687Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.0619036Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.0619348Z             )
2025-05-07T20:31:47.0619558Z         else:
2025-05-07T20:31:47.0619779Z             scale_ub_tensor = None
2025-05-07T20:31:47.0620042Z     
2025-05-07T20:31:47.0620288Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.0620611Z             op = silu_mul_quant
2025-05-07T20:31:47.0620868Z             if compiled:
2025-05-07T20:31:47.0621123Z                 op = torch.compile(op)
2025-05-07T20:31:47.0621430Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.0621711Z     
2025-05-07T20:31:47.0621906Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.0622084Z 
2025-05-07T20:31:47.0622189Z moe/activation_test.py:117: 
2025-05-07T20:31:47.0622493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.0622828Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.0623120Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.0623878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.0624461Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.0625131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.0625910Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.0626465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.0627170Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.0627850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.0628398Z     kernel = self.compile(
2025-05-07T20:31:47.0628959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.0629635Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.0630047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.0630285Z 
2025-05-07T20:31:47.0630499Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e8f6680>
2025-05-07T20:31:47.0631613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.0633041Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ec80550>}
2025-05-07T20:31:47.0634421Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.0635485Z context = <triton._C.libtriton.ir.context object at 0x7fc50ec62db0>
2025-05-07T20:31:47.0635814Z 
2025-05-07T20:31:47.0636013Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.0636640Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.0637117Z                            module_map=module_map)
2025-05-07T20:31:47.0637497Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.0637864Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.0638125Z E       ^
2025-05-07T20:31:47.0638603Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.0639080Z 
2025-05-07T20:31:47.0639508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.0640031Z 
2025-05-07T20:31:47.0640148Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.0640580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.0640983Z     T=128,
2025-05-07T20:31:47.0641176Z     D=7168,
2025-05-07T20:31:47.0641380Z     scale_ub=1200.0,
2025-05-07T20:31:47.0641607Z     contiguous=False,
2025-05-07T20:31:47.0641840Z     compiled=True,
2025-05-07T20:31:47.0642049Z )
2025-05-07T20:31:47.1677529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.1678946Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:47.1679509Z 
2025-05-07T20:31:47.1679672Z     @given(
2025-05-07T20:31:47.1680153Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.1680792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.1681411Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.1682089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.1682766Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.1683349Z     )
2025-05-07T20:31:47.1684528Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.1685450Z     def test_silu_mul_quant(
2025-05-07T20:31:47.1685815Z         self,
2025-05-07T20:31:47.1686046Z         T: int,
2025-05-07T20:31:47.1686256Z         D: int,
2025-05-07T20:31:47.1686488Z         scale_ub: Optional[float],
2025-05-07T20:31:47.1686764Z         contiguous: bool,
2025-05-07T20:31:47.1687014Z         compiled: bool,
2025-05-07T20:31:47.1687252Z     ) -> None:
2025-05-07T20:31:47.1687473Z         torch.manual_seed(2025)
2025-05-07T20:31:47.1687725Z     
2025-05-07T20:31:47.1688015Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.1688360Z     
2025-05-07T20:31:47.1688565Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.1688868Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.1689182Z         x = x_sign * x_clamp
2025-05-07T20:31:47.1689434Z         x0 = x[:, :D]
2025-05-07T20:31:47.1689670Z         x1 = x[:, D:]
2025-05-07T20:31:47.1689889Z     
2025-05-07T20:31:47.1690083Z         if contiguous:
2025-05-07T20:31:47.1690326Z             x0 = x0.contiguous()
2025-05-07T20:31:47.1690601Z             x1 = x1.contiguous()
2025-05-07T20:31:47.1690846Z     
2025-05-07T20:31:47.1691049Z         if scale_ub is not None:
2025-05-07T20:31:47.1691333Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.1691677Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.1691999Z             )
2025-05-07T20:31:47.1692204Z         else:
2025-05-07T20:31:47.1692417Z             scale_ub_tensor = None
2025-05-07T20:31:47.1692677Z     
2025-05-07T20:31:47.1692921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.1693239Z             op = silu_mul_quant
2025-05-07T20:31:47.1693500Z             if compiled:
2025-05-07T20:31:47.1693757Z                 op = torch.compile(op)
2025-05-07T20:31:47.1694065Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.1694354Z     
2025-05-07T20:31:47.1694557Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.1694725Z 
2025-05-07T20:31:47.1694835Z moe/activation_test.py:117: 
2025-05-07T20:31:47.1695293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.1695632Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.1695925Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.1696500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.1697081Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.1697761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.1698549Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.1699099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.1699806Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.1700490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.1701043Z     kernel = self.compile(
2025-05-07T20:31:47.1701604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.1702283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.1702692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.1702921Z 
2025-05-07T20:31:47.1703133Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ec03c70>
2025-05-07T20:31:47.1704333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.1705822Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ec80f70>}
2025-05-07T20:31:47.1707207Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.1708257Z context = <triton._C.libtriton.ir.context object at 0x7fc50ec1c230>
2025-05-07T20:31:47.1708557Z 
2025-05-07T20:31:47.1708732Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.1709275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.1709759Z                            module_map=module_map)
2025-05-07T20:31:47.1710135Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.1710509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.1710782Z E       ^
2025-05-07T20:31:47.1711255Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.1711729Z 
2025-05-07T20:31:47.1712157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.1712695Z 
2025-05-07T20:31:47.1712805Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.1713236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.1713646Z     T=2048,
2025-05-07T20:31:47.1713846Z     D=7168,
2025-05-07T20:31:47.1714054Z     scale_ub=None,
2025-05-07T20:31:47.1714274Z     contiguous=True,
2025-05-07T20:31:47.1714510Z     compiled=True,
2025-05-07T20:31:47.1714734Z )
2025-05-07T20:31:47.1715065Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.1715580Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:47.1715888Z 
2025-05-07T20:31:47.1715988Z     @given(
2025-05-07T20:31:47.1716230Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.1716642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.1716959Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.1717300Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.1717635Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.1717930Z     )
2025-05-07T20:31:47.1718297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.1718757Z     def test_silu_mul_quant(
2025-05-07T20:31:47.1719004Z         self,
2025-05-07T20:31:47.1719210Z         T: int,
2025-05-07T20:31:47.1719418Z         D: int,
2025-05-07T20:31:47.1719646Z         scale_ub: Optional[float],
2025-05-07T20:31:47.1719931Z         contiguous: bool,
2025-05-07T20:31:47.1720184Z         compiled: bool,
2025-05-07T20:31:47.1720417Z     ) -> None:
2025-05-07T20:31:47.1720648Z         torch.manual_seed(2025)
2025-05-07T20:31:47.1720902Z     
2025-05-07T20:31:47.1721181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.1721542Z     
2025-05-07T20:31:47.1721749Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.1722046Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.1722367Z         x = x_sign * x_clamp
2025-05-07T20:31:47.1722620Z         x0 = x[:, :D]
2025-05-07T20:31:47.1722841Z         x1 = x[:, D:]
2025-05-07T20:31:47.1723060Z     
2025-05-07T20:31:47.1723259Z         if contiguous:
2025-05-07T20:31:47.1723500Z             x0 = x0.contiguous()
2025-05-07T20:31:47.1723775Z             x1 = x1.contiguous()
2025-05-07T20:31:47.1724028Z     
2025-05-07T20:31:47.1724225Z         if scale_ub is not None:
2025-05-07T20:31:47.1724512Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.1724946Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.1725268Z             )
2025-05-07T20:31:47.1725465Z         else:
2025-05-07T20:31:47.1725686Z             scale_ub_tensor = None
2025-05-07T20:31:47.1725954Z     
2025-05-07T20:31:47.1726193Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.1726521Z             op = silu_mul_quant
2025-05-07T20:31:47.1726784Z             if compiled:
2025-05-07T20:31:47.1727037Z                 op = torch.compile(op)
2025-05-07T20:31:47.1727349Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.1727638Z     
2025-05-07T20:31:47.1727843Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.1728023Z 
2025-05-07T20:31:47.1728132Z moe/activation_test.py:117: 
2025-05-07T20:31:47.1728438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.1728776Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.1729063Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.1729644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:47.1730222Z     return fn(*args, **kwargs)
2025-05-07T20:31:47.1730904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.1731613Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.1732164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.1732866Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.1733545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.1734094Z     kernel = self.compile(
2025-05-07T20:31:47.1734656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.1735334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.1735792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.1736139Z 
2025-05-07T20:31:47.1736353Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e80ec20>
2025-05-07T20:31:47.1737465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.1738972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ec81bd0>}
2025-05-07T20:31:47.1740349Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.1741412Z context = <triton._C.libtriton.ir.context object at 0x7fc50e9318f0>
2025-05-07T20:31:47.1741707Z 
2025-05-07T20:31:47.1741886Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.1742432Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.1742914Z                            module_map=module_map)
2025-05-07T20:31:47.1743291Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.1743657Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.1743920Z E       ^
2025-05-07T20:31:47.1744397Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.1744862Z 
2025-05-07T20:31:47.1745296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.1745851Z 
2025-05-07T20:31:47.2598952Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.2599409Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.2599841Z     T=16384,
2025-05-07T20:31:47.2600060Z     D=5120,
2025-05-07T20:31:47.2600333Z     scale_ub=None,
2025-05-07T20:31:47.2610731Z     contiguous=False,
2025-05-07T20:31:47.2611088Z     compiled=False,
2025-05-07T20:31:47.2611437Z )
2025-05-07T20:31:47.2611900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.2612532Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:47.2612823Z 
2025-05-07T20:31:47.2612902Z     @given(
2025-05-07T20:31:47.2613143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.2613457Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.2613771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.2614110Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.2614452Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.2614746Z     )
2025-05-07T20:31:47.2615105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.2615591Z     def test_silu_mul_quant(
2025-05-07T20:31:47.2615860Z         self,
2025-05-07T20:31:47.2616059Z         T: int,
2025-05-07T20:31:47.2616260Z         D: int,
2025-05-07T20:31:47.2616475Z         scale_ub: Optional[float],
2025-05-07T20:31:47.2616752Z         contiguous: bool,
2025-05-07T20:31:47.2616993Z         compiled: bool,
2025-05-07T20:31:47.2617214Z     ) -> None:
2025-05-07T20:31:47.2617435Z         torch.manual_seed(2025)
2025-05-07T20:31:47.2617678Z     
2025-05-07T20:31:47.2617951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.2618417Z     
2025-05-07T20:31:47.2618617Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.2618908Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.2620990Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.2623161Z 
2025-05-07T20:31:47.2623286Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:47.2623512Z 
2025-05-07T20:31:47.2623617Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.2624046Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.2624448Z     T=4096,
2025-05-07T20:31:47.2624641Z     D=7168,
2025-05-07T20:31:47.2624839Z     scale_ub=1200.0,
2025-05-07T20:31:47.2625066Z     contiguous=True,
2025-05-07T20:31:47.2625297Z     compiled=True,
2025-05-07T20:31:47.2625506Z )
2025-05-07T20:31:47.2625824Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.2626332Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:47.2626612Z 
2025-05-07T20:31:47.2626689Z     @given(
2025-05-07T20:31:47.2626925Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.2627235Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.2627547Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.2627886Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.2628212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.2628504Z     )
2025-05-07T20:31:47.2628860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.2629301Z     def test_silu_mul_quant(
2025-05-07T20:31:47.2629545Z         self,
2025-05-07T20:31:47.2629826Z         T: int,
2025-05-07T20:31:47.2630027Z         D: int,
2025-05-07T20:31:47.2630243Z         scale_ub: Optional[float],
2025-05-07T20:31:47.2630521Z         contiguous: bool,
2025-05-07T20:31:47.2630763Z         compiled: bool,
2025-05-07T20:31:47.2630981Z     ) -> None:
2025-05-07T20:31:47.2631199Z         torch.manual_seed(2025)
2025-05-07T20:31:47.2631440Z     
2025-05-07T20:31:47.2631707Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.2632051Z     
2025-05-07T20:31:47.2632245Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.2632532Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.2634582Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.2636492Z 
2025-05-07T20:31:47.2636611Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:47.2636831Z 
2025-05-07T20:31:47.2636936Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.2637354Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.2637751Z     T=16384,
2025-05-07T20:31:47.2637945Z     D=7168,
2025-05-07T20:31:47.2638140Z     scale_ub=None,
2025-05-07T20:31:47.2638354Z     contiguous=False,
2025-05-07T20:31:47.2638583Z     compiled=False,
2025-05-07T20:31:47.2638787Z )
2025-05-07T20:31:47.2639103Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.2639609Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:47.2639887Z 
2025-05-07T20:31:47.2639973Z     @given(
2025-05-07T20:31:47.2640204Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.2640597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.2640903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.2641233Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.2641560Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.2641849Z     )
2025-05-07T20:31:47.2642203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.2642645Z     def test_silu_mul_quant(
2025-05-07T20:31:47.2642889Z         self,
2025-05-07T20:31:47.2643085Z         T: int,
2025-05-07T20:31:47.2643277Z         D: int,
2025-05-07T20:31:47.2643500Z         scale_ub: Optional[float],
2025-05-07T20:31:47.2643775Z         contiguous: bool,
2025-05-07T20:31:47.2644010Z         compiled: bool,
2025-05-07T20:31:47.2644239Z     ) -> None:
2025-05-07T20:31:47.2644455Z         torch.manual_seed(2025)
2025-05-07T20:31:47.2644692Z     
2025-05-07T20:31:47.2644969Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.2647074Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.2649009Z 
2025-05-07T20:31:47.2649128Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:47.2649340Z 
2025-05-07T20:31:47.2649448Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.2649948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.2650357Z     T=2048,
2025-05-07T20:31:47.2650545Z     D=7168,
2025-05-07T20:31:47.2650736Z     scale_ub=1200.0,
2025-05-07T20:31:47.2650963Z     contiguous=True,
2025-05-07T20:31:47.2651186Z     compiled=True,
2025-05-07T20:31:47.2651391Z )
2025-05-07T20:31:47.2651706Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.2652201Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:47.2652470Z 
2025-05-07T20:31:47.2652552Z     @given(
2025-05-07T20:31:47.2652776Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.2653090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.2653399Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.2653722Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.2654056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.2654349Z     )
2025-05-07T20:31:47.2654703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.2655146Z     def test_silu_mul_quant(
2025-05-07T20:31:47.2655387Z         self,
2025-05-07T20:31:47.2655911Z         T: int,
2025-05-07T20:31:47.2656131Z         D: int,
2025-05-07T20:31:47.2656351Z         scale_ub: Optional[float],
2025-05-07T20:31:47.2656632Z         contiguous: bool,
2025-05-07T20:31:47.2656868Z         compiled: bool,
2025-05-07T20:31:47.2657093Z     ) -> None:
2025-05-07T20:31:47.2657310Z         torch.manual_seed(2025)
2025-05-07T20:31:47.2657550Z     
2025-05-07T20:31:47.2657828Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.2658211Z     
2025-05-07T20:31:47.2658401Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.2658699Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.2660752Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.2662783Z 
2025-05-07T20:31:47.2662904Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:47.2663117Z 
2025-05-07T20:31:47.2663226Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.2663639Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.2664045Z     T=2048,
2025-05-07T20:31:47.2664233Z     D=7168,
2025-05-07T20:31:47.2664418Z     scale_ub=None,
2025-05-07T20:31:47.2664634Z     contiguous=True,
2025-05-07T20:31:47.2664863Z     compiled=False,
2025-05-07T20:31:47.2665063Z )
2025-05-07T20:31:47.3933898Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.3934573Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:47.3934948Z 
2025-05-07T20:31:47.3935036Z     @given(
2025-05-07T20:31:47.3935284Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.3935638Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.3935991Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.3936342Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.3936689Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.3936994Z     )
2025-05-07T20:31:47.3937366Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.3937831Z     def test_silu_mul_quant(
2025-05-07T20:31:47.3938190Z         self,
2025-05-07T20:31:47.3938404Z         T: int,
2025-05-07T20:31:47.3938939Z         D: int,
2025-05-07T20:31:47.3939175Z         scale_ub: Optional[float],
2025-05-07T20:31:47.3939467Z         contiguous: bool,
2025-05-07T20:31:47.3939727Z         compiled: bool,
2025-05-07T20:31:47.3939968Z     ) -> None:
2025-05-07T20:31:47.3940223Z         torch.manual_seed(2025)
2025-05-07T20:31:47.3940483Z     
2025-05-07T20:31:47.3940770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.3941120Z     
2025-05-07T20:31:47.3941326Z >       x_sign = torch.sign(x)
2025-05-07T20:31:47.3943352Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.3945276Z 
2025-05-07T20:31:47.3945413Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:47.3945636Z 
2025-05-07T20:31:47.3945777Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.3946232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.3946650Z     T=1,
2025-05-07T20:31:47.3946849Z     D=7168,
2025-05-07T20:31:47.3947052Z     scale_ub=1200.0,
2025-05-07T20:31:47.3947290Z     contiguous=True,
2025-05-07T20:31:47.3947528Z     compiled=False,
2025-05-07T20:31:47.3947763Z )
2025-05-07T20:31:47.3948099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.3948606Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:47.3948878Z 
2025-05-07T20:31:47.3948962Z     @given(
2025-05-07T20:31:47.3949213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.3949543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.3949861Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.3950366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.3950714Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.3951010Z     )
2025-05-07T20:31:47.3951379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.3951842Z     def test_silu_mul_quant(
2025-05-07T20:31:47.3952101Z         self,
2025-05-07T20:31:47.3952307Z         T: int,
2025-05-07T20:31:47.3952522Z         D: int,
2025-05-07T20:31:47.3952757Z         scale_ub: Optional[float],
2025-05-07T20:31:47.3953038Z         contiguous: bool,
2025-05-07T20:31:47.3953294Z         compiled: bool,
2025-05-07T20:31:47.3953540Z     ) -> None:
2025-05-07T20:31:47.3953768Z         torch.manual_seed(2025)
2025-05-07T20:31:47.3954024Z     
2025-05-07T20:31:47.3954319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.3954668Z     
2025-05-07T20:31:47.3954878Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.3955191Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.3955511Z         x = x_sign * x_clamp
2025-05-07T20:31:47.3956076Z         x0 = x[:, :D]
2025-05-07T20:31:47.3956310Z         x1 = x[:, D:]
2025-05-07T20:31:47.3956530Z     
2025-05-07T20:31:47.3956736Z         if contiguous:
2025-05-07T20:31:47.3956997Z             x0 = x0.contiguous()
2025-05-07T20:31:47.3957275Z             x1 = x1.contiguous()
2025-05-07T20:31:47.3957539Z     
2025-05-07T20:31:47.3957749Z         if scale_ub is not None:
2025-05-07T20:31:47.3958041Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.3958388Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.3958716Z             )
2025-05-07T20:31:47.3958926Z         else:
2025-05-07T20:31:47.3959304Z             scale_ub_tensor = None
2025-05-07T20:31:47.3959574Z     
2025-05-07T20:31:47.3959827Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.3960152Z             op = silu_mul_quant
2025-05-07T20:31:47.3960423Z             if compiled:
2025-05-07T20:31:47.3960687Z                 op = torch.compile(op)
2025-05-07T20:31:47.3961001Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.3961293Z     
2025-05-07T20:31:47.3961504Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.3961676Z 
2025-05-07T20:31:47.3961784Z moe/activation_test.py:117: 
2025-05-07T20:31:47.3962097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.3962441Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.3962744Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.3963457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.3964184Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.3964745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.3965449Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.3966186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.3966742Z     kernel = self.compile(
2025-05-07T20:31:47.3967308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.3967983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.3968398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.3968630Z 
2025-05-07T20:31:47.3968851Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e920850>
2025-05-07T20:31:47.3969966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.3971495Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ec83b50>}
2025-05-07T20:31:47.3972869Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.3973919Z context = <triton._C.libtriton.ir.context object at 0x7fc50e516bf0>
2025-05-07T20:31:47.3974217Z 
2025-05-07T20:31:47.3974402Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.3974938Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.3975433Z                            module_map=module_map)
2025-05-07T20:31:47.3975847Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.3976251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.3976524Z E       ^
2025-05-07T20:31:47.3977007Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.3977468Z 
2025-05-07T20:31:47.3977902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.3978519Z 
2025-05-07T20:31:47.3978634Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.3979059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.3979472Z     T=128,
2025-05-07T20:31:47.3979672Z     D=5120,
2025-05-07T20:31:47.3979871Z     scale_ub=None,
2025-05-07T20:31:47.3980098Z     contiguous=True,
2025-05-07T20:31:47.3980337Z     compiled=False,
2025-05-07T20:31:47.3980635Z )
2025-05-07T20:31:47.4762442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.4762961Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:47.4763256Z 
2025-05-07T20:31:47.4763343Z     @given(
2025-05-07T20:31:47.4763662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.4764096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.4764449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.4764785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.4765126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.4765423Z     )
2025-05-07T20:31:47.4765777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.4766231Z     def test_silu_mul_quant(
2025-05-07T20:31:47.4766482Z         self,
2025-05-07T20:31:47.4766682Z         T: int,
2025-05-07T20:31:47.4766900Z         D: int,
2025-05-07T20:31:47.4767134Z         scale_ub: Optional[float],
2025-05-07T20:31:47.4767419Z         contiguous: bool,
2025-05-07T20:31:47.4767665Z         compiled: bool,
2025-05-07T20:31:47.4767908Z     ) -> None:
2025-05-07T20:31:47.4768137Z         torch.manual_seed(2025)
2025-05-07T20:31:47.4768385Z     
2025-05-07T20:31:47.4768671Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.4769056Z     
2025-05-07T20:31:47.4769260Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.4769568Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.4769882Z         x = x_sign * x_clamp
2025-05-07T20:31:47.4770134Z         x0 = x[:, :D]
2025-05-07T20:31:47.4770363Z         x1 = x[:, D:]
2025-05-07T20:31:47.4770573Z     
2025-05-07T20:31:47.4770769Z         if contiguous:
2025-05-07T20:31:47.4771014Z             x0 = x0.contiguous()
2025-05-07T20:31:47.4771278Z             x1 = x1.contiguous()
2025-05-07T20:31:47.4771526Z     
2025-05-07T20:31:47.4771738Z         if scale_ub is not None:
2025-05-07T20:31:47.4772015Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.4772369Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.4772983Z             )
2025-05-07T20:31:47.4773183Z         else:
2025-05-07T20:31:47.4773404Z             scale_ub_tensor = None
2025-05-07T20:31:47.4773672Z     
2025-05-07T20:31:47.4773909Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.4774235Z             op = silu_mul_quant
2025-05-07T20:31:47.4774498Z             if compiled:
2025-05-07T20:31:47.4774758Z                 op = torch.compile(op)
2025-05-07T20:31:47.4775061Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.4775347Z     
2025-05-07T20:31:47.4775576Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.4775770Z 
2025-05-07T20:31:47.4775876Z moe/activation_test.py:117: 
2025-05-07T20:31:47.4776180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.4776522Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.4776809Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.4777518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.4778399Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.4778954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.4779650Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.4780328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.4780877Z     kernel = self.compile(
2025-05-07T20:31:47.4781430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.4782250Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.4782662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.4782894Z 
2025-05-07T20:31:47.4783116Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e922140>
2025-05-07T20:31:47.4784213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.4785627Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50e564670>}
2025-05-07T20:31:47.4787050Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.4788105Z context = <triton._C.libtriton.ir.context object at 0x7fc50e5acc30>
2025-05-07T20:31:47.4788398Z 
2025-05-07T20:31:47.4788578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.4789119Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.4789608Z                            module_map=module_map)
2025-05-07T20:31:47.4789988Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.4790354Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.4790626Z E       ^
2025-05-07T20:31:47.4791104Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.4791563Z 
2025-05-07T20:31:47.4791994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.4792516Z 
2025-05-07T20:31:47.4792625Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.4793067Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.4793480Z     T=128,
2025-05-07T20:31:47.4793761Z     D=7168,
2025-05-07T20:31:47.4793966Z     scale_ub=None,
2025-05-07T20:31:47.4794194Z     contiguous=True,
2025-05-07T20:31:47.4794422Z     compiled=False,
2025-05-07T20:31:47.4794638Z )
2025-05-07T20:31:47.4794968Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.4795472Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:47.4795785Z 
2025-05-07T20:31:47.4795873Z     @given(
2025-05-07T20:31:47.4796110Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.4796430Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.4796742Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.4797083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.4797428Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.4797715Z     )
2025-05-07T20:31:47.4798078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.4798541Z     def test_silu_mul_quant(
2025-05-07T20:31:47.4798792Z         self,
2025-05-07T20:31:47.4798989Z         T: int,
2025-05-07T20:31:47.4799196Z         D: int,
2025-05-07T20:31:47.4799428Z         scale_ub: Optional[float],
2025-05-07T20:31:47.4799710Z         contiguous: bool,
2025-05-07T20:31:47.4799958Z         compiled: bool,
2025-05-07T20:31:47.4800189Z     ) -> None:
2025-05-07T20:31:47.4800415Z         torch.manual_seed(2025)
2025-05-07T20:31:47.4800667Z     
2025-05-07T20:31:47.4800950Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.4801297Z     
2025-05-07T20:31:47.4801504Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.4801808Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.4802122Z         x = x_sign * x_clamp
2025-05-07T20:31:47.4802465Z         x0 = x[:, :D]
2025-05-07T20:31:47.4802713Z         x1 = x[:, D:]
2025-05-07T20:31:47.4802928Z     
2025-05-07T20:31:47.4803126Z         if contiguous:
2025-05-07T20:31:47.4803376Z             x0 = x0.contiguous()
2025-05-07T20:31:47.4803642Z             x1 = x1.contiguous()
2025-05-07T20:31:47.4803893Z     
2025-05-07T20:31:47.4804100Z         if scale_ub is not None:
2025-05-07T20:31:47.4813295Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.4813648Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.4813967Z             )
2025-05-07T20:31:47.4814167Z         else:
2025-05-07T20:31:47.4814375Z             scale_ub_tensor = None
2025-05-07T20:31:47.4814636Z     
2025-05-07T20:31:47.4814879Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.4815198Z             op = silu_mul_quant
2025-05-07T20:31:47.4815455Z             if compiled:
2025-05-07T20:31:47.4815718Z                 op = torch.compile(op)
2025-05-07T20:31:47.4816072Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.4816352Z     
2025-05-07T20:31:47.4816551Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.4816726Z 
2025-05-07T20:31:47.4816836Z moe/activation_test.py:117: 
2025-05-07T20:31:47.4817133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.4817472Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.4817764Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.4818577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.4819289Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.4819845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.4820547Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.4821226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.4821778Z     kernel = self.compile(
2025-05-07T20:31:47.4822344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.4823141Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.4823553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.4823789Z 
2025-05-07T20:31:47.4824003Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e922e90>
2025-05-07T20:31:47.4825114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.4826574Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50e564ee0>}
2025-05-07T20:31:47.4827955Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.4829015Z context = <triton._C.libtriton.ir.context object at 0x7fc50e5f2bb0>
2025-05-07T20:31:47.4829312Z 
2025-05-07T20:31:47.4829491Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.4830030Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.4830510Z                            module_map=module_map)
2025-05-07T20:31:47.4830891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.4831261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.4831525Z E       ^
2025-05-07T20:31:47.4832089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.4832549Z 
2025-05-07T20:31:47.4832983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.4833507Z 
2025-05-07T20:31:47.4833620Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.4834038Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.4834453Z     T=2048,
2025-05-07T20:31:47.4834652Z     D=7168,
2025-05-07T20:31:47.4834846Z     scale_ub=1200.0,
2025-05-07T20:31:47.4835077Z     contiguous=True,
2025-05-07T20:31:47.4835309Z     compiled=False,
2025-05-07T20:31:47.4835528Z )
2025-05-07T20:31:47.5783279Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.5783850Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:47.5784137Z 
2025-05-07T20:31:47.5784216Z     @given(
2025-05-07T20:31:47.5784475Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.5784788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.5785102Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.5785458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.5785793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.5786080Z     )
2025-05-07T20:31:47.5786443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.5786892Z     def test_silu_mul_quant(
2025-05-07T20:31:47.5787135Z         self,
2025-05-07T20:31:47.5787337Z         T: int,
2025-05-07T20:31:47.5787544Z         D: int,
2025-05-07T20:31:47.5787763Z         scale_ub: Optional[float],
2025-05-07T20:31:47.5788044Z         contiguous: bool,
2025-05-07T20:31:47.5788290Z         compiled: bool,
2025-05-07T20:31:47.5788521Z     ) -> None:
2025-05-07T20:31:47.5788744Z         torch.manual_seed(2025)
2025-05-07T20:31:47.5788998Z     
2025-05-07T20:31:47.5789279Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.5791406Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.5793709Z 
2025-05-07T20:31:47.5793831Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:47.5794053Z 
2025-05-07T20:31:47.5794159Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.5794587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.5795001Z     T=1,
2025-05-07T20:31:47.5795193Z     D=5120,
2025-05-07T20:31:47.5795396Z     scale_ub=1200.0,
2025-05-07T20:31:47.5795644Z     contiguous=True,
2025-05-07T20:31:47.5795911Z     compiled=False,
2025-05-07T20:31:47.5796130Z )
2025-05-07T20:31:47.5796451Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.5796953Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:47.5797225Z 
2025-05-07T20:31:47.5797304Z     @given(
2025-05-07T20:31:47.5797540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.5797854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.5798172Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.5798506Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.5798838Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.5799130Z     )
2025-05-07T20:31:47.5799627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.5800077Z     def test_silu_mul_quant(
2025-05-07T20:31:47.5800318Z         self,
2025-05-07T20:31:47.5800517Z         T: int,
2025-05-07T20:31:47.5800725Z         D: int,
2025-05-07T20:31:47.5800943Z         scale_ub: Optional[float],
2025-05-07T20:31:47.5801220Z         contiguous: bool,
2025-05-07T20:31:47.5801465Z         compiled: bool,
2025-05-07T20:31:47.5801687Z     ) -> None:
2025-05-07T20:31:47.5801912Z         torch.manual_seed(2025)
2025-05-07T20:31:47.5802161Z     
2025-05-07T20:31:47.5802434Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.5802782Z     
2025-05-07T20:31:47.5802986Z         x_sign = torch.sign(x)
2025-05-07T20:31:47.5803280Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:47.5803594Z         x = x_sign * x_clamp
2025-05-07T20:31:47.5803842Z         x0 = x[:, :D]
2025-05-07T20:31:47.5804059Z         x1 = x[:, D:]
2025-05-07T20:31:47.5804274Z     
2025-05-07T20:31:47.5804470Z         if contiguous:
2025-05-07T20:31:47.5804705Z             x0 = x0.contiguous()
2025-05-07T20:31:47.5804970Z             x1 = x1.contiguous()
2025-05-07T20:31:47.5805222Z     
2025-05-07T20:31:47.5805416Z         if scale_ub is not None:
2025-05-07T20:31:47.5805700Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:47.5806095Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:47.5806412Z             )
2025-05-07T20:31:47.5806605Z         else:
2025-05-07T20:31:47.5806826Z             scale_ub_tensor = None
2025-05-07T20:31:47.5807090Z     
2025-05-07T20:31:47.5807327Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:47.5807649Z             op = silu_mul_quant
2025-05-07T20:31:47.5807908Z             if compiled:
2025-05-07T20:31:47.5808156Z                 op = torch.compile(op)
2025-05-07T20:31:47.5808467Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.5808748Z     
2025-05-07T20:31:47.5808948Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:47.5809124Z 
2025-05-07T20:31:47.5809227Z moe/activation_test.py:117: 
2025-05-07T20:31:47.5809532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.5809957Z moe/activation_test.py:115: in fn
2025-05-07T20:31:47.5810240Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:47.5810952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:47.5811658Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:47.5812206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:47.5812907Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:47.5813586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:47.5814130Z     kernel = self.compile(
2025-05-07T20:31:47.5814689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:47.5815364Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:47.5815768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:47.5815998Z 
2025-05-07T20:31:47.5816210Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e5fdba0>
2025-05-07T20:31:47.5817309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:47.5818779Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50e565e10>}
2025-05-07T20:31:47.5820227Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:47.5821278Z context = <triton._C.libtriton.ir.context object at 0x7fc50e7dfab0>
2025-05-07T20:31:47.5821572Z 
2025-05-07T20:31:47.5821744Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:47.5822280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:47.5822760Z                            module_map=module_map)
2025-05-07T20:31:47.5823134Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:47.5823496Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:47.5823773Z E       ^
2025-05-07T20:31:47.5824249Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:47.5824705Z 
2025-05-07T20:31:47.5825133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:47.5825686Z 
2025-05-07T20:31:47.5825811Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.5826246Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.5826650Z     T=2048,
2025-05-07T20:31:47.5826841Z     D=5120,
2025-05-07T20:31:47.5827039Z     scale_ub=None,
2025-05-07T20:31:47.5827262Z     contiguous=True,
2025-05-07T20:31:47.5827488Z     compiled=False,
2025-05-07T20:31:47.5827702Z )
2025-05-07T20:31:47.5828028Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.5828528Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:47.5828812Z 
2025-05-07T20:31:47.5828891Z     @given(
2025-05-07T20:31:47.5829131Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.5829445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.5829767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.5830106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.5830529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.5830814Z     )
2025-05-07T20:31:47.5831174Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.5831622Z     def test_silu_mul_quant(
2025-05-07T20:31:47.5831863Z         self,
2025-05-07T20:31:47.5832062Z         T: int,
2025-05-07T20:31:47.5832262Z         D: int,
2025-05-07T20:31:47.5832485Z         scale_ub: Optional[float],
2025-05-07T20:31:47.5832766Z         contiguous: bool,
2025-05-07T20:31:47.5833014Z         compiled: bool,
2025-05-07T20:31:47.5833234Z     ) -> None:
2025-05-07T20:31:47.5833457Z         torch.manual_seed(2025)
2025-05-07T20:31:47.5833702Z     
2025-05-07T20:31:47.5833978Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.5834327Z     
2025-05-07T20:31:47.5834549Z >       x_sign = torch.sign(x)
2025-05-07T20:31:47.5836596Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.5838504Z 
2025-05-07T20:31:47.5838627Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:47.5838843Z 
2025-05-07T20:31:47.5838957Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.5839376Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.5839789Z     T=16384,
2025-05-07T20:31:47.5840102Z     D=5120,
2025-05-07T20:31:47.5840297Z     scale_ub=None,
2025-05-07T20:31:47.5840519Z     contiguous=True,
2025-05-07T20:31:47.5840752Z     compiled=False,
2025-05-07T20:31:47.5840961Z )
2025-05-07T20:31:47.6815257Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.6815811Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:47.6816105Z 
2025-05-07T20:31:47.6816185Z     @given(
2025-05-07T20:31:47.6816423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.6816741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.6817059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.6817400Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.6817732Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.6818091Z     )
2025-05-07T20:31:47.6818453Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.6818914Z     def test_silu_mul_quant(
2025-05-07T20:31:47.6819164Z         self,
2025-05-07T20:31:47.6819364Z         T: int,
2025-05-07T20:31:47.6819570Z         D: int,
2025-05-07T20:31:47.6819799Z         scale_ub: Optional[float],
2025-05-07T20:31:47.6820081Z         contiguous: bool,
2025-05-07T20:31:47.6820331Z         compiled: bool,
2025-05-07T20:31:47.6820555Z     ) -> None:
2025-05-07T20:31:47.6820779Z         torch.manual_seed(2025)
2025-05-07T20:31:47.6821029Z     
2025-05-07T20:31:47.6821307Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.6823424Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.6826167Z 
2025-05-07T20:31:47.6826296Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:47.6826538Z 
2025-05-07T20:31:47.6826653Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.6827130Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.6827593Z     T=4096,
2025-05-07T20:31:47.6827798Z     D=5120,
2025-05-07T20:31:47.6828001Z     scale_ub=None,
2025-05-07T20:31:47.6828229Z     contiguous=True,
2025-05-07T20:31:47.6828470Z     compiled=False,
2025-05-07T20:31:47.6828693Z )
2025-05-07T20:31:47.6829048Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.6829624Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:47.6829940Z 
2025-05-07T20:31:47.6830025Z     @given(
2025-05-07T20:31:47.6830273Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.6830629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.6830980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.6831355Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.6831728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.6832053Z     )
2025-05-07T20:31:47.6832456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.6832971Z     def test_silu_mul_quant(
2025-05-07T20:31:47.6833237Z         self,
2025-05-07T20:31:47.6833446Z         T: int,
2025-05-07T20:31:47.6833652Z         D: int,
2025-05-07T20:31:47.6833886Z         scale_ub: Optional[float],
2025-05-07T20:31:47.6834194Z         contiguous: bool,
2025-05-07T20:31:47.6834451Z         compiled: bool,
2025-05-07T20:31:47.6834693Z     ) -> None:
2025-05-07T20:31:47.6834926Z         torch.manual_seed(2025)
2025-05-07T20:31:47.6835316Z     
2025-05-07T20:31:47.6835624Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.6838288Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.6840695Z 
2025-05-07T20:31:47.6840823Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:47.6841065Z 
2025-05-07T20:31:47.6841180Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.6841657Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.6842124Z     T=2048,
2025-05-07T20:31:47.6842322Z     D=5120,
2025-05-07T20:31:47.6842519Z     scale_ub=None,
2025-05-07T20:31:47.6842755Z     contiguous=False,
2025-05-07T20:31:47.6843000Z     compiled=False,
2025-05-07T20:31:47.6843214Z )
2025-05-07T20:31:47.6843574Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.6844153Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:47.6844473Z 
2025-05-07T20:31:47.6844558Z     @given(
2025-05-07T20:31:47.6844798Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.6845152Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.6845495Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.6845862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.6846235Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.6846561Z     )
2025-05-07T20:31:47.6846962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.6847486Z     def test_silu_mul_quant(
2025-05-07T20:31:47.6847759Z         self,
2025-05-07T20:31:47.6848046Z         T: int,
2025-05-07T20:31:47.6848261Z         D: int,
2025-05-07T20:31:47.6848500Z         scale_ub: Optional[float],
2025-05-07T20:31:47.6848802Z         contiguous: bool,
2025-05-07T20:31:47.6849058Z         compiled: bool,
2025-05-07T20:31:47.6849299Z     ) -> None:
2025-05-07T20:31:47.6849530Z         torch.manual_seed(2025)
2025-05-07T20:31:47.6849790Z     
2025-05-07T20:31:47.6850089Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.6852744Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.6855151Z 
2025-05-07T20:31:47.6855277Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:47.6855888Z 
2025-05-07T20:31:47.6856047Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.6856473Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.6856886Z     T=4096,
2025-05-07T20:31:47.6857084Z     D=7168,
2025-05-07T20:31:47.6857286Z     scale_ub=None,
2025-05-07T20:31:47.6857501Z     contiguous=True,
2025-05-07T20:31:47.6857731Z     compiled=True,
2025-05-07T20:31:47.6857943Z )
2025-05-07T20:31:47.6858352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.6858858Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:47.6859269Z 
2025-05-07T20:31:47.6859361Z     @given(
2025-05-07T20:31:47.6859603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.6859962Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.6860306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.6860672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.6861049Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.6861369Z     )
2025-05-07T20:31:47.6861768Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.6862282Z     def test_silu_mul_quant(
2025-05-07T20:31:47.6862547Z         self,
2025-05-07T20:31:47.6862754Z         T: int,
2025-05-07T20:31:47.6862959Z         D: int,
2025-05-07T20:31:47.6863193Z         scale_ub: Optional[float],
2025-05-07T20:31:47.6863493Z         contiguous: bool,
2025-05-07T20:31:47.6863747Z         compiled: bool,
2025-05-07T20:31:47.6863979Z     ) -> None:
2025-05-07T20:31:47.6864208Z         torch.manual_seed(2025)
2025-05-07T20:31:47.6864452Z     
2025-05-07T20:31:47.6864736Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.6866900Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.6868807Z 
2025-05-07T20:31:47.6868928Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:47.6869143Z 
2025-05-07T20:31:47.6869257Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.6869679Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.6870089Z     T=2048,
2025-05-07T20:31:47.6870283Z     D=5120,
2025-05-07T20:31:47.6870606Z     scale_ub=1200.0,
2025-05-07T20:31:47.6870839Z     contiguous=False,
2025-05-07T20:31:47.6871074Z     compiled=False,
2025-05-07T20:31:47.6871280Z )
2025-05-07T20:31:47.6871612Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.6872125Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:47.6872404Z 
2025-05-07T20:31:47.6872490Z     @given(
2025-05-07T20:31:47.6872719Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.6873044Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.6873364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.6873698Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.6874040Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.6874339Z     )
2025-05-07T20:31:47.6874694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.6875152Z     def test_silu_mul_quant(
2025-05-07T20:31:47.6875408Z         self,
2025-05-07T20:31:47.6875629Z         T: int,
2025-05-07T20:31:47.6875859Z         D: int,
2025-05-07T20:31:47.6876087Z         scale_ub: Optional[float],
2025-05-07T20:31:47.6876362Z         contiguous: bool,
2025-05-07T20:31:47.6876612Z         compiled: bool,
2025-05-07T20:31:47.6876845Z     ) -> None:
2025-05-07T20:31:47.6877070Z         torch.manual_seed(2025)
2025-05-07T20:31:47.6877316Z     
2025-05-07T20:31:47.6877600Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.6879784Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.6881690Z 
2025-05-07T20:31:47.6881816Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:47.6882031Z 
2025-05-07T20:31:47.6882143Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.6882575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.6882986Z     T=4096,
2025-05-07T20:31:47.6883183Z     D=7168,
2025-05-07T20:31:47.6883376Z     scale_ub=1200.0,
2025-05-07T20:31:47.6883606Z     contiguous=True,
2025-05-07T20:31:47.6883836Z     compiled=False,
2025-05-07T20:31:47.6884042Z )
2025-05-07T20:31:47.8162618Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.8163226Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:47.8163550Z 
2025-05-07T20:31:47.8163639Z     @given(
2025-05-07T20:31:47.8163893Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.8164249Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.8164602Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.8164974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.8165363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.8165694Z     )
2025-05-07T20:31:47.8166099Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.8166617Z     def test_silu_mul_quant(
2025-05-07T20:31:47.8166884Z         self,
2025-05-07T20:31:47.8167098Z         T: int,
2025-05-07T20:31:47.8167307Z         D: int,
2025-05-07T20:31:47.8175948Z         scale_ub: Optional[float],
2025-05-07T20:31:47.8176275Z         contiguous: bool,
2025-05-07T20:31:47.8176537Z         compiled: bool,
2025-05-07T20:31:47.8176770Z     ) -> None:
2025-05-07T20:31:47.8176985Z         torch.manual_seed(2025)
2025-05-07T20:31:47.8177493Z     
2025-05-07T20:31:47.8177772Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.8179994Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.8181902Z 
2025-05-07T20:31:47.8182031Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:47.8182249Z 
2025-05-07T20:31:47.8182359Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.8182783Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.8183194Z     T=16384,
2025-05-07T20:31:47.8183386Z     D=7168,
2025-05-07T20:31:47.8183583Z     scale_ub=None,
2025-05-07T20:31:47.8183805Z     contiguous=False,
2025-05-07T20:31:47.8184034Z     compiled=True,
2025-05-07T20:31:47.8184250Z )
2025-05-07T20:31:47.8184580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.8185086Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:47.8185365Z 
2025-05-07T20:31:47.8185443Z     @given(
2025-05-07T20:31:47.8185678Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.8186036Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.8186360Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.8186699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.8187189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.8187478Z     )
2025-05-07T20:31:47.8187835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.8188292Z     def test_silu_mul_quant(
2025-05-07T20:31:47.8188542Z         self,
2025-05-07T20:31:47.8188734Z         T: int,
2025-05-07T20:31:47.8188934Z         D: int,
2025-05-07T20:31:47.8189157Z         scale_ub: Optional[float],
2025-05-07T20:31:47.8189432Z         contiguous: bool,
2025-05-07T20:31:47.8189678Z         compiled: bool,
2025-05-07T20:31:47.8189910Z     ) -> None:
2025-05-07T20:31:47.8190125Z         torch.manual_seed(2025)
2025-05-07T20:31:47.8190369Z     
2025-05-07T20:31:47.8190652Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.8192741Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.8194654Z 
2025-05-07T20:31:47.8194773Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:47.8194994Z 
2025-05-07T20:31:47.8195100Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.8195523Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.8195935Z     T=4096,
2025-05-07T20:31:47.8196122Z     D=7168,
2025-05-07T20:31:47.8196322Z     scale_ub=None,
2025-05-07T20:31:47.8196541Z     contiguous=True,
2025-05-07T20:31:47.8196763Z     compiled=False,
2025-05-07T20:31:47.8196972Z )
2025-05-07T20:31:47.8197306Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.8197807Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:47.8198173Z 
2025-05-07T20:31:47.8198252Z     @given(
2025-05-07T20:31:47.8198478Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.8198794Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.8199101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.8199428Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.8199761Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.8200048Z     )
2025-05-07T20:31:47.8200400Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.8200842Z     def test_silu_mul_quant(
2025-05-07T20:31:47.8201082Z         self,
2025-05-07T20:31:47.8201277Z         T: int,
2025-05-07T20:31:47.8201469Z         D: int,
2025-05-07T20:31:47.8201692Z         scale_ub: Optional[float],
2025-05-07T20:31:47.8201966Z         contiguous: bool,
2025-05-07T20:31:47.8202203Z         compiled: bool,
2025-05-07T20:31:47.8202429Z     ) -> None:
2025-05-07T20:31:47.8202650Z         torch.manual_seed(2025)
2025-05-07T20:31:47.8202892Z     
2025-05-07T20:31:47.8203167Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.8205267Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.8207227Z 
2025-05-07T20:31:47.8207428Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:47.8207644Z 
2025-05-07T20:31:47.8207755Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.8208174Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.8208587Z     T=16384,
2025-05-07T20:31:47.8208783Z     D=7168,
2025-05-07T20:31:47.8208971Z     scale_ub=None,
2025-05-07T20:31:47.8209190Z     contiguous=True,
2025-05-07T20:31:47.8209421Z     compiled=False,
2025-05-07T20:31:47.8209623Z )
2025-05-07T20:31:47.8209946Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.8210451Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:47.8210730Z 
2025-05-07T20:31:47.8210817Z     @given(
2025-05-07T20:31:47.8211043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.8211359Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.8211669Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.8212003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.8212338Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.8212625Z     )
2025-05-07T20:31:47.8212973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.8213423Z     def test_silu_mul_quant(
2025-05-07T20:31:47.8213665Z         self,
2025-05-07T20:31:47.8213854Z         T: int,
2025-05-07T20:31:47.8214049Z         D: int,
2025-05-07T20:31:47.8214271Z         scale_ub: Optional[float],
2025-05-07T20:31:47.8214539Z         contiguous: bool,
2025-05-07T20:31:47.8214780Z         compiled: bool,
2025-05-07T20:31:47.8215004Z     ) -> None:
2025-05-07T20:31:47.8215220Z         torch.manual_seed(2025)
2025-05-07T20:31:47.8215459Z     
2025-05-07T20:31:47.8215736Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.8217883Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.8219927Z 
2025-05-07T20:31:47.8220055Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:47.8220269Z 
2025-05-07T20:31:47.8220374Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.8220793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.8221199Z     T=16384,
2025-05-07T20:31:47.8221392Z     D=7168,
2025-05-07T20:31:47.8221581Z     scale_ub=1200.0,
2025-05-07T20:31:47.8221806Z     contiguous=True,
2025-05-07T20:31:47.8222035Z     compiled=False,
2025-05-07T20:31:47.8222234Z )
2025-05-07T20:31:47.8222554Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:47.8223062Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:47.8223340Z 
2025-05-07T20:31:47.8223416Z     @given(
2025-05-07T20:31:47.8223645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:47.8223959Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:47.8224261Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:47.8224594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:47.8224929Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:47.8225218Z     )
2025-05-07T20:31:47.8225565Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:47.8226013Z     def test_silu_mul_quant(
2025-05-07T20:31:47.8226258Z         self,
2025-05-07T20:31:47.8226453Z         T: int,
2025-05-07T20:31:47.8226738Z         D: int,
2025-05-07T20:31:47.8226959Z         scale_ub: Optional[float],
2025-05-07T20:31:47.8227228Z         contiguous: bool,
2025-05-07T20:31:47.8227478Z         compiled: bool,
2025-05-07T20:31:47.8227704Z     ) -> None:
2025-05-07T20:31:47.8227917Z         torch.manual_seed(2025)
2025-05-07T20:31:47.8228165Z     
2025-05-07T20:31:47.8228442Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:47.8230549Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:47.8232451Z 
2025-05-07T20:31:47.8232576Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:47.8232792Z 
2025-05-07T20:31:47.8232901Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:47.8233319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:47.8233726Z     T=128,
2025-05-07T20:31:47.8233908Z     D=5120,
2025-05-07T20:31:47.8234101Z     scale_ub=1200.0,
2025-05-07T20:31:47.8234329Z     contiguous=False,
2025-05-07T20:31:47.8234549Z     compiled=False,
2025-05-07T20:31:47.8234753Z )
2025-05-07T20:31:48.1780151Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.1780726Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.1781011Z 
2025-05-07T20:31:48.1781101Z     @given(
2025-05-07T20:31:48.1781338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.1781694Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.1782016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.1782358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.1783080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.1783379Z     )
2025-05-07T20:31:48.1783742Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.1784201Z     def test_silu_mul_quant(
2025-05-07T20:31:48.1784453Z         self,
2025-05-07T20:31:48.1784657Z         T: int,
2025-05-07T20:31:48.1784866Z         D: int,
2025-05-07T20:31:48.1785097Z         scale_ub: Optional[float],
2025-05-07T20:31:48.1785384Z         contiguous: bool,
2025-05-07T20:31:48.1785631Z         compiled: bool,
2025-05-07T20:31:48.1785878Z     ) -> None:
2025-05-07T20:31:48.1786155Z         torch.manual_seed(2025)
2025-05-07T20:31:48.1786420Z     
2025-05-07T20:31:48.1786721Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.1787121Z     
2025-05-07T20:31:48.1787325Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.1787652Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.1788013Z         x = x_sign * x_clamp
2025-05-07T20:31:48.1788275Z         x0 = x[:, :D]
2025-05-07T20:31:48.1788511Z         x1 = x[:, D:]
2025-05-07T20:31:48.1788739Z     
2025-05-07T20:31:48.1788936Z         if contiguous:
2025-05-07T20:31:48.1789190Z             x0 = x0.contiguous()
2025-05-07T20:31:48.1789478Z             x1 = x1.contiguous()
2025-05-07T20:31:48.1789740Z     
2025-05-07T20:31:48.1789946Z         if scale_ub is not None:
2025-05-07T20:31:48.1790252Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.1790625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.1790977Z             )
2025-05-07T20:31:48.1791184Z         else:
2025-05-07T20:31:48.1791410Z             scale_ub_tensor = None
2025-05-07T20:31:48.1791686Z     
2025-05-07T20:31:48.1792115Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.1792459Z             op = silu_mul_quant
2025-05-07T20:31:48.1792717Z             if compiled:
2025-05-07T20:31:48.1792985Z                 op = torch.compile(op)
2025-05-07T20:31:48.1793302Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.1793589Z     
2025-05-07T20:31:48.1793788Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.1793966Z 
2025-05-07T20:31:48.1794075Z moe/activation_test.py:117: 
2025-05-07T20:31:48.1794383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.1794720Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.1795023Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.1795763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.1796513Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.1797076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.1797797Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.1798495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.1799050Z     kernel = self.compile(
2025-05-07T20:31:48.1799619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.1800303Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.1800715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.1800946Z 
2025-05-07T20:31:48.1801162Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e61e5f0>
2025-05-07T20:31:48.1802282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.1803714Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50e6cdcf0>}
2025-05-07T20:31:48.1805181Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.1806286Z context = <triton._C.libtriton.ir.context object at 0x7fc50e43c3f0>
2025-05-07T20:31:48.1806587Z 
2025-05-07T20:31:48.1806759Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.1807298Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.1807786Z                            module_map=module_map)
2025-05-07T20:31:48.1808162Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.1808529Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.1808803Z E       ^
2025-05-07T20:31:48.1809281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.1809750Z 
2025-05-07T20:31:48.1810177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.1810709Z 
2025-05-07T20:31:48.1810819Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.1811251Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.1811659Z     T=2048,
2025-05-07T20:31:48.1811859Z     D=7168,
2025-05-07T20:31:48.1812063Z     scale_ub=None,
2025-05-07T20:31:48.1812285Z     contiguous=False,
2025-05-07T20:31:48.1812523Z     compiled=False,
2025-05-07T20:31:48.1812741Z )
2025-05-07T20:31:48.1813153Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.1813671Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.1813957Z 
2025-05-07T20:31:48.1814045Z     @given(
2025-05-07T20:31:48.1814285Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.1814611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.1814942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.1815284Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.1815629Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.1815939Z     )
2025-05-07T20:31:48.1816339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.1816795Z     def test_silu_mul_quant(
2025-05-07T20:31:48.1817042Z         self,
2025-05-07T20:31:48.1817250Z         T: int,
2025-05-07T20:31:48.1817458Z         D: int,
2025-05-07T20:31:48.1817682Z         scale_ub: Optional[float],
2025-05-07T20:31:48.1817971Z         contiguous: bool,
2025-05-07T20:31:48.1818333Z         compiled: bool,
2025-05-07T20:31:48.1818562Z     ) -> None:
2025-05-07T20:31:48.1818792Z         torch.manual_seed(2025)
2025-05-07T20:31:48.1819053Z     
2025-05-07T20:31:48.1819332Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.1821446Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.1823351Z 
2025-05-07T20:31:48.1823477Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.1823708Z 
2025-05-07T20:31:48.1823818Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.1824249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.1824744Z     T=128,
2025-05-07T20:31:48.1824940Z     D=7168,
2025-05-07T20:31:48.1825136Z     scale_ub=1200.0,
2025-05-07T20:31:48.1825360Z     contiguous=True,
2025-05-07T20:31:48.1825594Z     compiled=True,
2025-05-07T20:31:48.1825828Z )
2025-05-07T20:31:48.2252770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.2253320Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.2253593Z 
2025-05-07T20:31:48.2253681Z     @given(
2025-05-07T20:31:48.2253917Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.2254242Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.2254670Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.2255048Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.2255379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.2256050Z     )
2025-05-07T20:31:48.2256429Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.2256877Z     def test_silu_mul_quant(
2025-05-07T20:31:48.2257124Z         self,
2025-05-07T20:31:48.2257324Z         T: int,
2025-05-07T20:31:48.2257520Z         D: int,
2025-05-07T20:31:48.2257747Z         scale_ub: Optional[float],
2025-05-07T20:31:48.2258113Z         contiguous: bool,
2025-05-07T20:31:48.2258359Z         compiled: bool,
2025-05-07T20:31:48.2258594Z     ) -> None:
2025-05-07T20:31:48.2258824Z         torch.manual_seed(2025)
2025-05-07T20:31:48.2259083Z     
2025-05-07T20:31:48.2259382Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.2259774Z     
2025-05-07T20:31:48.2259975Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.2260527Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.2260850Z         x = x_sign * x_clamp
2025-05-07T20:31:48.2261101Z         x0 = x[:, :D]
2025-05-07T20:31:48.2261316Z         x1 = x[:, D:]
2025-05-07T20:31:48.2261536Z     
2025-05-07T20:31:48.2261735Z         if contiguous:
2025-05-07T20:31:48.2261968Z             x0 = x0.contiguous()
2025-05-07T20:31:48.2262241Z             x1 = x1.contiguous()
2025-05-07T20:31:48.2262491Z     
2025-05-07T20:31:48.2262685Z         if scale_ub is not None:
2025-05-07T20:31:48.2262966Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.2263313Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.2263625Z             )
2025-05-07T20:31:48.2263826Z         else:
2025-05-07T20:31:48.2264047Z             scale_ub_tensor = None
2025-05-07T20:31:48.2264299Z     
2025-05-07T20:31:48.2264543Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.2264866Z             op = silu_mul_quant
2025-05-07T20:31:48.2265124Z             if compiled:
2025-05-07T20:31:48.2265379Z                 op = torch.compile(op)
2025-05-07T20:31:48.2265685Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.2265973Z     
2025-05-07T20:31:48.2266167Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.2266342Z 
2025-05-07T20:31:48.2266449Z moe/activation_test.py:117: 
2025-05-07T20:31:48.2266749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.2267076Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.2267368Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.2267943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.2268514Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.2269190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.2269929Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.2270606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.2271306Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.2272147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.2272692Z     kernel = self.compile(
2025-05-07T20:31:48.2273246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.2273919Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.2274315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.2274552Z 
2025-05-07T20:31:48.2274764Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e4d1720>
2025-05-07T20:31:48.2275871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.2277297Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50e6cf0a0>}
2025-05-07T20:31:48.2278664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.2279713Z context = <triton._C.libtriton.ir.context object at 0x7fc50e48f1f0>
2025-05-07T20:31:48.2280010Z 
2025-05-07T20:31:48.2280180Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.2280715Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.2281276Z                            module_map=module_map)
2025-05-07T20:31:48.2281654Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.2282020Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.2282288Z E       ^
2025-05-07T20:31:48.2282767Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.2283233Z 
2025-05-07T20:31:48.2283657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.2284178Z 
2025-05-07T20:31:48.2284291Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.2284709Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.2285118Z     T=128,
2025-05-07T20:31:48.2285313Z     D=7168,
2025-05-07T20:31:48.2285509Z     scale_ub=1200.0,
2025-05-07T20:31:48.2285739Z     contiguous=True,
2025-05-07T20:31:48.2285971Z     compiled=False,
2025-05-07T20:31:48.2286183Z )
2025-05-07T20:31:48.2286520Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.2287021Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.2287300Z 
2025-05-07T20:31:48.2287387Z     @given(
2025-05-07T20:31:48.2287617Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.2287935Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.2288247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.2288577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.2288913Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.2289202Z     )
2025-05-07T20:31:48.2289556Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.2290004Z     def test_silu_mul_quant(
2025-05-07T20:31:48.2290251Z         self,
2025-05-07T20:31:48.2290452Z         T: int,
2025-05-07T20:31:48.2290647Z         D: int,
2025-05-07T20:31:48.2290875Z         scale_ub: Optional[float],
2025-05-07T20:31:48.2291152Z         contiguous: bool,
2025-05-07T20:31:48.2291393Z         compiled: bool,
2025-05-07T20:31:48.2291747Z     ) -> None:
2025-05-07T20:31:48.2291968Z         torch.manual_seed(2025)
2025-05-07T20:31:48.2292209Z     
2025-05-07T20:31:48.2292490Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.2292837Z     
2025-05-07T20:31:48.2293030Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.2293333Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.2295406Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.2297320Z 
2025-05-07T20:31:48.2297443Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:48.2297658Z 
2025-05-07T20:31:48.2297772Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.2298335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.2298751Z     T=128,
2025-05-07T20:31:48.2298948Z     D=5120,
2025-05-07T20:31:48.2299142Z     scale_ub=1200.0,
2025-05-07T20:31:48.2299373Z     contiguous=True,
2025-05-07T20:31:48.2299623Z     compiled=True,
2025-05-07T20:31:48.2299831Z )
2025-05-07T20:31:48.2300162Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.2300662Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.2300932Z 
2025-05-07T20:31:48.2301019Z     @given(
2025-05-07T20:31:48.2301337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.2301660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.2311477Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.2311836Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.2312178Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.2312472Z     )
2025-05-07T20:31:48.2312840Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.2313287Z     def test_silu_mul_quant(
2025-05-07T20:31:48.2313538Z         self,
2025-05-07T20:31:48.2313741Z         T: int,
2025-05-07T20:31:48.2313938Z         D: int,
2025-05-07T20:31:48.2314167Z         scale_ub: Optional[float],
2025-05-07T20:31:48.2314450Z         contiguous: bool,
2025-05-07T20:31:48.2314692Z         compiled: bool,
2025-05-07T20:31:48.2314931Z     ) -> None:
2025-05-07T20:31:48.2315158Z         torch.manual_seed(2025)
2025-05-07T20:31:48.2315401Z     
2025-05-07T20:31:48.2315702Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.2316101Z     
2025-05-07T20:31:48.2316297Z >       x_sign = torch.sign(x)
2025-05-07T20:31:48.2318307Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.2320207Z 
2025-05-07T20:31:48.2320330Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:48.2320557Z 
2025-05-07T20:31:48.2320664Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.2321097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.2321506Z     T=128,
2025-05-07T20:31:48.2321702Z     D=7168,
2025-05-07T20:31:48.2322046Z     scale_ub=None,
2025-05-07T20:31:48.2322266Z     contiguous=True,
2025-05-07T20:31:48.2322504Z     compiled=True,
2025-05-07T20:31:48.2322719Z )
2025-05-07T20:31:48.5244324Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5244942Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.5245218Z 
2025-05-07T20:31:48.5245305Z     @given(
2025-05-07T20:31:48.5245564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5245901Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5246232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5246803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5247494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5248123Z     )
2025-05-07T20:31:48.5248847Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5249759Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5250279Z         self,
2025-05-07T20:31:48.5250685Z         T: int,
2025-05-07T20:31:48.5251094Z         D: int,
2025-05-07T20:31:48.5251551Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5252110Z         contiguous: bool,
2025-05-07T20:31:48.5252612Z         compiled: bool,
2025-05-07T20:31:48.5253091Z     ) -> None:
2025-05-07T20:31:48.5253539Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5254048Z     
2025-05-07T20:31:48.5254618Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5257655Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.5259673Z 
2025-05-07T20:31:48.5259809Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.5260031Z 
2025-05-07T20:31:48.5312118Z FAILED
2025-05-07T20:31:48.5312366Z 
2025-05-07T20:31:48.5312666Z =================================== FAILURES ===================================
2025-05-07T20:31:48.5313130Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:31:48.5313577Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:31:48.5314243Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:31:48.5314909Z   |     yield
2025-05-07T20:31:48.5315498Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run
2025-05-07T20:31:48.5316208Z   |     self._callTestMethod(testMethod)
2025-05-07T20:31:48.5316992Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
2025-05-07T20:31:48.5317771Z   |     method()
2025-05-07T20:31:48.5318694Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:31:48.5319727Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5320644Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:31:48.5321580Z   |     raise the_error_hypothesis_found
2025-05-07T20:31:48.5322277Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:31:48.5322962Z   +-+---------------- 1 ----------------
2025-05-07T20:31:48.5323381Z     | Traceback (most recent call last):
2025-05-07T20:31:48.5324387Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:48.5325702Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5328415Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.5330511Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:48.5331086Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5331520Z     |     T=128,
2025-05-07T20:31:48.5331737Z     |     D=7168,
2025-05-07T20:31:48.5331963Z     |     scale_ub=1200.0,
2025-05-07T20:31:48.5332219Z     |     contiguous=True,
2025-05-07T20:31:48.5332518Z     |     compiled=False,
2025-05-07T20:31:48.5332815Z     | )
2025-05-07T20:31:48.5332993Z     | 
2025-05-07T20:31:48.5333544Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:31:48.5334173Z     +---------------- 2 ----------------
2025-05-07T20:31:48.5334503Z     | Traceback (most recent call last):
2025-05-07T20:31:48.5335509Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:48.5336755Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5339051Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.5341089Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:48.5341533Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5341945Z     |     T=128,
2025-05-07T20:31:48.5342155Z     |     D=7168,
2025-05-07T20:31:48.5342364Z     |     scale_ub=None,
2025-05-07T20:31:48.5342616Z     |     contiguous=True,
2025-05-07T20:31:48.5342871Z     |     compiled=True,
2025-05-07T20:31:48.5343092Z     | )
2025-05-07T20:31:48.5343283Z     | 
2025-05-07T20:31:48.5343816Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:48.5344439Z     +---------------- 3 ----------------
2025-05-07T20:31:48.5344732Z     | Traceback (most recent call last):
2025-05-07T20:31:48.5345463Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:31:48.5346282Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5348411Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.5350555Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:48.5351005Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5351420Z     |     T=128,
2025-05-07T20:31:48.5351627Z     |     D=5120,
2025-05-07T20:31:48.5351836Z     |     scale_ub=1200.0,
2025-05-07T20:31:48.5352084Z     |     contiguous=True,
2025-05-07T20:31:48.5352328Z     |     compiled=True,
2025-05-07T20:31:48.5352550Z     | )
2025-05-07T20:31:48.5352739Z     | 
2025-05-07T20:31:48.5353273Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:31:48.5353939Z     +---------------- 4 ----------------
2025-05-07T20:31:48.5354251Z     | Traceback (most recent call last):
2025-05-07T20:31:48.5355198Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:31:48.5356422Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:48.5357367Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:31:48.5358368Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5359588Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:31:48.5360734Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:48.5361743Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:31:48.5362797Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5363852Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:31:48.5364953Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5366092Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:31:48.5367290Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5368396Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:31:48.5369385Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:48.5370299Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:31:48.5371106Z     |     fn()
2025-05-07T20:31:48.5371909Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:31:48.5372794Z     |     self.fn.run(
2025-05-07T20:31:48.5373531Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:31:48.5374364Z     |     kernel = self.compile(
2025-05-07T20:31:48.5375238Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:31:48.5376225Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5377108Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:48.5377914Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5378705Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5379062Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:31:48.5379330Z     | ^
2025-05-07T20:31:48.5379806Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5380386Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:31:48.5380795Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:31:48.5381319Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5381762Z     |     T=1,  # or any other generated value
2025-05-07T20:31:48.5382078Z     |     D=5120,  # or any other generated value
2025-05-07T20:31:48.5382427Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:31:48.5382802Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:31:48.5383182Z     |     compiled=True,  # or any other generated value
2025-05-07T20:31:48.5383485Z     | )
2025-05-07T20:31:48.5383673Z     | 
2025-05-07T20:31:48.5384210Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:31:48.5384820Z     +------------------------------------
2025-05-07T20:31:48.5385190Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:31:48.5385571Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5385996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5386562Z     T=1,
2025-05-07T20:31:48.5386837Z     D=5120,
2025-05-07T20:31:48.5387124Z     scale_ub=None,
2025-05-07T20:31:48.5387572Z     contiguous=True,
2025-05-07T20:31:48.5387894Z     compiled=True,
2025-05-07T20:31:48.5388195Z )
2025-05-07T20:31:48.5388651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5389369Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.5389747Z 
2025-05-07T20:31:48.5389868Z     @given(
2025-05-07T20:31:48.5390204Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5390667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5391120Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5395657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5396150Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5396629Z     )
2025-05-07T20:31:48.5397140Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5397773Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5398141Z         self,
2025-05-07T20:31:48.5398438Z         T: int,
2025-05-07T20:31:48.5398717Z         D: int,
2025-05-07T20:31:48.5399045Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5399465Z         contiguous: bool,
2025-05-07T20:31:48.5399808Z         compiled: bool,
2025-05-07T20:31:48.5400140Z     ) -> None:
2025-05-07T20:31:48.5400459Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5400810Z     
2025-05-07T20:31:48.5401208Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5401710Z     
2025-05-07T20:31:48.5401990Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5402418Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5402871Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5403231Z         x0 = x[:, :D]
2025-05-07T20:31:48.5403543Z         x1 = x[:, D:]
2025-05-07T20:31:48.5403850Z     
2025-05-07T20:31:48.5404127Z         if contiguous:
2025-05-07T20:31:48.5404459Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5404845Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5405202Z     
2025-05-07T20:31:48.5405479Z         if scale_ub is not None:
2025-05-07T20:31:48.5405883Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5406530Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5407012Z             )
2025-05-07T20:31:48.5407295Z         else:
2025-05-07T20:31:48.5407600Z             scale_ub_tensor = None
2025-05-07T20:31:48.5407971Z     
2025-05-07T20:31:48.5408312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5408767Z             op = silu_mul_quant
2025-05-07T20:31:48.5409122Z             if compiled:
2025-05-07T20:31:48.5409482Z                 op = torch.compile(op)
2025-05-07T20:31:48.5409920Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5410328Z     
2025-05-07T20:31:48.5410610Z         y_fp8, y_scale = fn()
2025-05-07T20:31:48.5411028Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:48.5411444Z     
2025-05-07T20:31:48.5411786Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5412271Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:48.5412706Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:48.5413157Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:48.5413680Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5414128Z     
2025-05-07T20:31:48.5414415Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:48.5414710Z 
2025-05-07T20:31:48.5414856Z moe/activation_test.py:126: 
2025-05-07T20:31:48.5415292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5415782Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:48.5416255Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5417494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:48.5418691Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:48.5419467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5420454Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5421456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:48.5422515Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5423593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:48.5424685Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5425735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:48.5426702Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:48.5427575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:48.5428342Z     fn()
2025-05-07T20:31:48.5429090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:48.5429932Z     self.fn.run(
2025-05-07T20:31:48.5430613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5431363Z     kernel = self.compile(
2025-05-07T20:31:48.5432111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5433006Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5433545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5433861Z 
2025-05-07T20:31:48.5434162Z self = <triton.compiler.compiler.ASTSource object at 0x7fc53c0d6a70>
2025-05-07T20:31:48.5435688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5437803Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53d8f4af0>}
2025-05-07T20:31:48.5439615Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5441031Z context = <triton._C.libtriton.ir.context object at 0x7fc53d8bf3f0>
2025-05-07T20:31:48.5441445Z 
2025-05-07T20:31:48.5441700Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5442405Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5443006Z                            module_map=module_map)
2025-05-07T20:31:48.5443473Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5443928Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:48.5444264Z E       ^
2025-05-07T20:31:48.5444873Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5445491Z 
2025-05-07T20:31:48.5446032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5446697Z 
2025-05-07T20:31:48.5446839Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5447357Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5447894Z     T=2048,
2025-05-07T20:31:48.5448253Z     D=5120,
2025-05-07T20:31:48.5448497Z     scale_ub=1200.0,
2025-05-07T20:31:48.5448807Z     contiguous=True,
2025-05-07T20:31:48.5449099Z     compiled=False,
2025-05-07T20:31:48.5449360Z )
2025-05-07T20:31:48.5449765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5450407Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.5450776Z 
2025-05-07T20:31:48.5450892Z     @given(
2025-05-07T20:31:48.5451188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5451583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5451972Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5452384Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5452804Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5453191Z     )
2025-05-07T20:31:48.5453690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5454275Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5454617Z         self,
2025-05-07T20:31:48.5454873Z         T: int,
2025-05-07T20:31:48.5455132Z         D: int,
2025-05-07T20:31:48.5455409Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5456078Z         contiguous: bool,
2025-05-07T20:31:48.5456401Z         compiled: bool,
2025-05-07T20:31:48.5456743Z     ) -> None:
2025-05-07T20:31:48.5457015Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5457336Z     
2025-05-07T20:31:48.5457692Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5458261Z     
2025-05-07T20:31:48.5458537Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5458953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5459393Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5459730Z         x0 = x[:, :D]
2025-05-07T20:31:48.5460031Z         x1 = x[:, D:]
2025-05-07T20:31:48.5460319Z     
2025-05-07T20:31:48.5460594Z         if contiguous:
2025-05-07T20:31:48.5460925Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5461288Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5461843Z     
2025-05-07T20:31:48.5462115Z         if scale_ub is not None:
2025-05-07T20:31:48.5462489Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5462950Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5463400Z             )
2025-05-07T20:31:48.5463693Z         else:
2025-05-07T20:31:48.5484497Z             scale_ub_tensor = None
2025-05-07T20:31:48.5484890Z     
2025-05-07T20:31:48.5485246Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5485700Z             op = silu_mul_quant
2025-05-07T20:31:48.5486062Z             if compiled:
2025-05-07T20:31:48.5486452Z                 op = torch.compile(op)
2025-05-07T20:31:48.5486901Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5487299Z     
2025-05-07T20:31:48.5487599Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.5487837Z 
2025-05-07T20:31:48.5487988Z moe/activation_test.py:117: 
2025-05-07T20:31:48.5488405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5488888Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.5489299Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5490278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.5491266Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.5492039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5492996Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5493937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5494701Z     kernel = self.compile(
2025-05-07T20:31:48.5495715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5496666Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5497220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5497555Z 
2025-05-07T20:31:48.5497847Z self = <triton.compiler.compiler.ASTSource object at 0x7fc53d962590>
2025-05-07T20:31:48.5499487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5501500Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53d946ef0>}
2025-05-07T20:31:48.5503418Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5504882Z context = <triton._C.libtriton.ir.context object at 0x7fc53d33bd30>
2025-05-07T20:31:48.5505299Z 
2025-05-07T20:31:48.5505540Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5506343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5507013Z                            module_map=module_map)
2025-05-07T20:31:48.5507534Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5508028Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.5508394Z E       ^
2025-05-07T20:31:48.5509064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5509718Z 
2025-05-07T20:31:48.5510316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5511038Z 
2025-05-07T20:31:48.5511320Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5511905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5512469Z     T=2048,
2025-05-07T20:31:48.5512741Z     D=5120,
2025-05-07T20:31:48.5513014Z     scale_ub=1200.0,
2025-05-07T20:31:48.5513338Z     contiguous=True,
2025-05-07T20:31:48.5513648Z     compiled=True,
2025-05-07T20:31:48.5513927Z )
2025-05-07T20:31:48.5514383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5515094Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.5515466Z 
2025-05-07T20:31:48.5515580Z     @given(
2025-05-07T20:31:48.5515887Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5516356Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5516819Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5517290Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5517777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5518202Z     )
2025-05-07T20:31:48.5518711Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5519350Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5519698Z         self,
2025-05-07T20:31:48.5519975Z         T: int,
2025-05-07T20:31:48.5520258Z         D: int,
2025-05-07T20:31:48.5520585Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5520974Z         contiguous: bool,
2025-05-07T20:31:48.5521327Z         compiled: bool,
2025-05-07T20:31:48.5521653Z     ) -> None:
2025-05-07T20:31:48.5521962Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5522317Z     
2025-05-07T20:31:48.5522716Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5523202Z     
2025-05-07T20:31:48.5523594Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5524019Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5524461Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5524813Z         x0 = x[:, :D]
2025-05-07T20:31:48.5525131Z         x1 = x[:, D:]
2025-05-07T20:31:48.5525422Z     
2025-05-07T20:31:48.5525680Z         if contiguous:
2025-05-07T20:31:48.5526020Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5526380Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5526786Z     
2025-05-07T20:31:48.5527071Z         if scale_ub is not None:
2025-05-07T20:31:48.5527465Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5527965Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5528411Z             )
2025-05-07T20:31:48.5528686Z         else:
2025-05-07T20:31:48.5528977Z             scale_ub_tensor = None
2025-05-07T20:31:48.5529339Z     
2025-05-07T20:31:48.5529674Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5530131Z             op = silu_mul_quant
2025-05-07T20:31:48.5530482Z             if compiled:
2025-05-07T20:31:48.5530814Z                 op = torch.compile(op)
2025-05-07T20:31:48.5531232Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5531618Z     
2025-05-07T20:31:48.5531881Z         y_fp8, y_scale = fn()
2025-05-07T20:31:48.5532262Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:48.5532660Z     
2025-05-07T20:31:48.5532988Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5533462Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:48.5533885Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:48.5534329Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:48.5534828Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5535265Z     
2025-05-07T20:31:48.5535533Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:48.5535813Z 
2025-05-07T20:31:48.5535989Z moe/activation_test.py:126: 
2025-05-07T20:31:48.5536386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5536954Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:48.5537384Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5538537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:48.5539584Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:48.5540354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5541319Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5542285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:48.5543307Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5544377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:48.5545473Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5546572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:48.5547489Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:48.5548154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:48.5548686Z     fn()
2025-05-07T20:31:48.5549206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:48.5549802Z     self.fn.run(
2025-05-07T20:31:48.5550399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5550942Z     kernel = self.compile(
2025-05-07T20:31:48.5551496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5552171Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5552576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5552805Z 
2025-05-07T20:31:48.5553018Z self = <triton.compiler.compiler.ASTSource object at 0x7fc53d8e08e0>
2025-05-07T20:31:48.5554122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5556052Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53d9b4790>}
2025-05-07T20:31:48.5557442Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5558483Z context = <triton._C.libtriton.ir.context object at 0x7fc5159a5970>
2025-05-07T20:31:48.5558789Z 
2025-05-07T20:31:48.5558962Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5559501Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5559984Z                            module_map=module_map)
2025-05-07T20:31:48.5560359Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5560725Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:48.5561000Z E       ^
2025-05-07T20:31:48.5561473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5561935Z 
2025-05-07T20:31:48.5562355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5563752Z 
2025-05-07T20:31:48.5563859Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5564282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5564684Z     T=16384,
2025-05-07T20:31:48.5564880Z     D=7168,
2025-05-07T20:31:48.5565078Z     scale_ub=1200.0,
2025-05-07T20:31:48.5565305Z     contiguous=False,
2025-05-07T20:31:48.5565543Z     compiled=False,
2025-05-07T20:31:48.5565755Z )
2025-05-07T20:31:48.5566127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5566636Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.5566924Z 
2025-05-07T20:31:48.5567003Z     @given(
2025-05-07T20:31:48.5567247Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5567561Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5567873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5568214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5568545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5568841Z     )
2025-05-07T20:31:48.5569203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5569647Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5569894Z         self,
2025-05-07T20:31:48.5570092Z         T: int,
2025-05-07T20:31:48.5570296Z         D: int,
2025-05-07T20:31:48.5570512Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5570788Z         contiguous: bool,
2025-05-07T20:31:48.5571032Z         compiled: bool,
2025-05-07T20:31:48.5571256Z     ) -> None:
2025-05-07T20:31:48.5571477Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5571727Z     
2025-05-07T20:31:48.5572131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5572489Z     
2025-05-07T20:31:48.5572695Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5572996Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5573315Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5573561Z         x0 = x[:, :D]
2025-05-07T20:31:48.5573778Z         x1 = x[:, D:]
2025-05-07T20:31:48.5573990Z     
2025-05-07T20:31:48.5574184Z         if contiguous:
2025-05-07T20:31:48.5574415Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5574683Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5574932Z     
2025-05-07T20:31:48.5575127Z         if scale_ub is not None:
2025-05-07T20:31:48.5575414Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5575757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5576068Z             )
2025-05-07T20:31:48.5576263Z         else:
2025-05-07T20:31:48.5576522Z             scale_ub_tensor = None
2025-05-07T20:31:48.5576788Z     
2025-05-07T20:31:48.5577025Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5577345Z             op = silu_mul_quant
2025-05-07T20:31:48.5577608Z             if compiled:
2025-05-07T20:31:48.5577858Z                 op = torch.compile(op)
2025-05-07T20:31:48.5578345Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5578626Z     
2025-05-07T20:31:48.5578823Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.5578994Z 
2025-05-07T20:31:48.5579112Z moe/activation_test.py:117: 
2025-05-07T20:31:48.5579415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5579749Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.5580035Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5580738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.5581455Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.5582005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5582824Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5583504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5584050Z     kernel = self.compile(
2025-05-07T20:31:48.5584597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5585269Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5585674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5585903Z 
2025-05-07T20:31:48.5586127Z self = <triton.compiler.compiler.ASTSource object at 0x7fc515ad4580>
2025-05-07T20:31:48.5587279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5588691Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53d28d2d0>}
2025-05-07T20:31:48.5590063Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5591104Z context = <triton._C.libtriton.ir.context object at 0x7fc515a274f0>
2025-05-07T20:31:48.5591395Z 
2025-05-07T20:31:48.5591569Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5592091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5592649Z                            module_map=module_map)
2025-05-07T20:31:48.5593027Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5593385Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.5593649Z E       ^
2025-05-07T20:31:48.5594122Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5594580Z 
2025-05-07T20:31:48.5595010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5595528Z 
2025-05-07T20:31:48.5595634Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5596059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5596465Z     T=1,
2025-05-07T20:31:48.5596647Z     D=7168,
2025-05-07T20:31:48.5596843Z     scale_ub=None,
2025-05-07T20:31:48.5597062Z     contiguous=True,
2025-05-07T20:31:48.5597283Z     compiled=True,
2025-05-07T20:31:48.5597494Z )
2025-05-07T20:31:48.5597826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5598315Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.5598581Z 
2025-05-07T20:31:48.5598657Z     @given(
2025-05-07T20:31:48.5598889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5599208Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5599510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5599842Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5600176Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5600460Z     )
2025-05-07T20:31:48.5600821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5601268Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5601512Z         self,
2025-05-07T20:31:48.5601702Z         T: int,
2025-05-07T20:31:48.5601901Z         D: int,
2025-05-07T20:31:48.5602130Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5602401Z         contiguous: bool,
2025-05-07T20:31:48.5602645Z         compiled: bool,
2025-05-07T20:31:48.5602961Z     ) -> None:
2025-05-07T20:31:48.5603174Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5603421Z     
2025-05-07T20:31:48.5603700Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5604038Z     
2025-05-07T20:31:48.5604238Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5604537Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5604842Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5605086Z         x0 = x[:, :D]
2025-05-07T20:31:48.5605307Z         x1 = x[:, D:]
2025-05-07T20:31:48.5605511Z     
2025-05-07T20:31:48.5605703Z         if contiguous:
2025-05-07T20:31:48.5605938Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5606206Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5606487Z     
2025-05-07T20:31:48.5606690Z         if scale_ub is not None:
2025-05-07T20:31:48.5606965Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5607299Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5607611Z             )
2025-05-07T20:31:48.5607806Z         else:
2025-05-07T20:31:48.5608011Z             scale_ub_tensor = None
2025-05-07T20:31:48.5608265Z     
2025-05-07T20:31:48.5608500Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5608813Z             op = silu_mul_quant
2025-05-07T20:31:48.5609064Z             if compiled:
2025-05-07T20:31:48.5609315Z                 op = torch.compile(op)
2025-05-07T20:31:48.5609610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5609886Z     
2025-05-07T20:31:48.5610084Z         y_fp8, y_scale = fn()
2025-05-07T20:31:48.5610369Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:48.5610661Z     
2025-05-07T20:31:48.5610903Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5611331Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:48.5611624Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:48.5611945Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:48.5612308Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5612613Z     
2025-05-07T20:31:48.5612820Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:48.5613017Z 
2025-05-07T20:31:48.5613123Z moe/activation_test.py:126: 
2025-05-07T20:31:48.5613415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5613747Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:48.5614083Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5614884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:48.5615644Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:48.5616204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5616952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5617644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:48.5618484Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5619246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:48.5620004Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5620739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:48.5621397Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:48.5622009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:48.5622620Z     fn()
2025-05-07T20:31:48.5623130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:48.5623721Z     self.fn.run(
2025-05-07T20:31:48.5624197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5624730Z     kernel = self.compile(
2025-05-07T20:31:48.5625280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5625944Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5626342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5626572Z 
2025-05-07T20:31:48.5626787Z self = <triton.compiler.compiler.ASTSource object at 0x7fc52c0b2fe0>
2025-05-07T20:31:48.5627882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5629287Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53c0fd5a0>}
2025-05-07T20:31:48.5630648Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5631691Z context = <triton._C.libtriton.ir.context object at 0x7fc5172f98f0>
2025-05-07T20:31:48.5631982Z 
2025-05-07T20:31:48.5632153Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5632838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5633321Z                            module_map=module_map)
2025-05-07T20:31:48.5633692Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5634053Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:48.5634324Z E       ^
2025-05-07T20:31:48.5634796Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5635251Z 
2025-05-07T20:31:48.5635673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5636196Z 
2025-05-07T20:31:48.5636310Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5636769Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5637166Z     T=4096,
2025-05-07T20:31:48.5637360Z     D=5120,
2025-05-07T20:31:48.5637555Z     scale_ub=None,
2025-05-07T20:31:48.5637781Z     contiguous=False,
2025-05-07T20:31:48.5638005Z     compiled=False,
2025-05-07T20:31:48.5638213Z )
2025-05-07T20:31:48.5638539Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5639040Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.5639320Z 
2025-05-07T20:31:48.5639397Z     @given(
2025-05-07T20:31:48.5639631Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5639941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5640254Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5640590Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5640917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5641204Z     )
2025-05-07T20:31:48.5641559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5642009Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5642253Z         self,
2025-05-07T20:31:48.5642453Z         T: int,
2025-05-07T20:31:48.5642653Z         D: int,
2025-05-07T20:31:48.5642871Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5643238Z         contiguous: bool,
2025-05-07T20:31:48.5643486Z         compiled: bool,
2025-05-07T20:31:48.5643707Z     ) -> None:
2025-05-07T20:31:48.5643926Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5644173Z     
2025-05-07T20:31:48.5644445Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5644791Z     
2025-05-07T20:31:48.5644993Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5645287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5645599Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5645841Z         x0 = x[:, :D]
2025-05-07T20:31:48.5646055Z         x1 = x[:, D:]
2025-05-07T20:31:48.5646267Z     
2025-05-07T20:31:48.5646464Z         if contiguous:
2025-05-07T20:31:48.5646732Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5646995Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5647236Z     
2025-05-07T20:31:48.5647431Z         if scale_ub is not None:
2025-05-07T20:31:48.5647707Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5648045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5648356Z             )
2025-05-07T20:31:48.5648543Z         else:
2025-05-07T20:31:48.5648754Z             scale_ub_tensor = None
2025-05-07T20:31:48.5649009Z     
2025-05-07T20:31:48.5649241Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5649561Z             op = silu_mul_quant
2025-05-07T20:31:48.5649816Z             if compiled:
2025-05-07T20:31:48.5650061Z                 op = torch.compile(op)
2025-05-07T20:31:48.5650362Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5650638Z     
2025-05-07T20:31:48.5650828Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.5650998Z 
2025-05-07T20:31:48.5651097Z moe/activation_test.py:117: 
2025-05-07T20:31:48.5651503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5651836Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.5652121Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5652822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.5653522Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.5654063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5654755Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5655427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5656292Z     kernel = self.compile(
2025-05-07T20:31:48.5665393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5666114Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5666567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5666806Z 
2025-05-07T20:31:48.5667018Z self = <triton.compiler.compiler.ASTSource object at 0x7fc515ad4760>
2025-05-07T20:31:48.5668126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5669531Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53f59e7a0>}
2025-05-07T20:31:48.5670909Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5671948Z context = <triton._C.libtriton.ir.context object at 0x7fc517157e70>
2025-05-07T20:31:48.5672465Z 
2025-05-07T20:31:48.5672639Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5673177Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5673656Z                            module_map=module_map)
2025-05-07T20:31:48.5674025Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5674388Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.5674659Z E       ^
2025-05-07T20:31:48.5675133Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5675598Z 
2025-05-07T20:31:48.5676029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5676557Z 
2025-05-07T20:31:48.5676665Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5677097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5677505Z     T=4096,
2025-05-07T20:31:48.5677704Z     D=7168,
2025-05-07T20:31:48.5677905Z     scale_ub=None,
2025-05-07T20:31:48.5678125Z     contiguous=False,
2025-05-07T20:31:48.5678359Z     compiled=False,
2025-05-07T20:31:48.5678574Z )
2025-05-07T20:31:48.5678900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5679409Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.5679694Z 
2025-05-07T20:31:48.5679772Z     @given(
2025-05-07T20:31:48.5680014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5680331Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5680649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5681122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5681457Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5681754Z     )
2025-05-07T20:31:48.5682122Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5682580Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5682822Z         self,
2025-05-07T20:31:48.5683025Z         T: int,
2025-05-07T20:31:48.5683229Z         D: int,
2025-05-07T20:31:48.5683448Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5683728Z         contiguous: bool,
2025-05-07T20:31:48.5683979Z         compiled: bool,
2025-05-07T20:31:48.5684205Z     ) -> None:
2025-05-07T20:31:48.5684431Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5684682Z     
2025-05-07T20:31:48.5684960Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5685315Z     
2025-05-07T20:31:48.5685520Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5685821Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5686138Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5686386Z         x0 = x[:, :D]
2025-05-07T20:31:48.5686605Z         x1 = x[:, D:]
2025-05-07T20:31:48.5686820Z     
2025-05-07T20:31:48.5687019Z         if contiguous:
2025-05-07T20:31:48.5687255Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5687521Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5687766Z     
2025-05-07T20:31:48.5687955Z         if scale_ub is not None:
2025-05-07T20:31:48.5688240Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5688589Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5688894Z             )
2025-05-07T20:31:48.5689092Z         else:
2025-05-07T20:31:48.5689309Z             scale_ub_tensor = None
2025-05-07T20:31:48.5689560Z     
2025-05-07T20:31:48.5689799Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5690121Z             op = silu_mul_quant
2025-05-07T20:31:48.5690373Z             if compiled:
2025-05-07T20:31:48.5690627Z                 op = torch.compile(op)
2025-05-07T20:31:48.5690936Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5691300Z     
2025-05-07T20:31:48.5691504Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.5691671Z 
2025-05-07T20:31:48.5691781Z moe/activation_test.py:117: 
2025-05-07T20:31:48.5692076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5692411Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.5692700Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5693406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.5694105Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.5694656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5695354Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5696031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5696602Z     kernel = self.compile(
2025-05-07T20:31:48.5697188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5697862Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5698337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5698572Z 
2025-05-07T20:31:48.5698786Z self = <triton.compiler.compiler.ASTSource object at 0x7fc53f459b70>
2025-05-07T20:31:48.5699892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5701379Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53d946cb0>}
2025-05-07T20:31:48.5702751Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5703790Z context = <triton._C.libtriton.ir.context object at 0x7fc5171c3070>
2025-05-07T20:31:48.5704088Z 
2025-05-07T20:31:48.5704258Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5704790Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5705267Z                            module_map=module_map)
2025-05-07T20:31:48.5705641Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5706003Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.5706273Z E       ^
2025-05-07T20:31:48.5706767Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5707262Z 
2025-05-07T20:31:48.5707684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5708202Z 
2025-05-07T20:31:48.5708314Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5708739Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5709140Z     T=128,
2025-05-07T20:31:48.5709334Z     D=7168,
2025-05-07T20:31:48.5709532Z     scale_ub=None,
2025-05-07T20:31:48.5709751Z     contiguous=False,
2025-05-07T20:31:48.5709982Z     compiled=True,
2025-05-07T20:31:48.5710188Z )
2025-05-07T20:31:48.5710512Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5711015Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:48.5711291Z 
2025-05-07T20:31:48.5711374Z     @given(
2025-05-07T20:31:48.5711605Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5712020Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5712335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5712674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5713004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5713297Z     )
2025-05-07T20:31:48.5713657Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5714104Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5714350Z         self,
2025-05-07T20:31:48.5714551Z         T: int,
2025-05-07T20:31:48.5714750Z         D: int,
2025-05-07T20:31:48.5714975Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5715256Z         contiguous: bool,
2025-05-07T20:31:48.5715494Z         compiled: bool,
2025-05-07T20:31:48.5715726Z     ) -> None:
2025-05-07T20:31:48.5715962Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5716203Z     
2025-05-07T20:31:48.5716486Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5716840Z     
2025-05-07T20:31:48.5717038Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5717340Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5717656Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5717899Z         x0 = x[:, :D]
2025-05-07T20:31:48.5718113Z         x1 = x[:, D:]
2025-05-07T20:31:48.5718324Z     
2025-05-07T20:31:48.5718514Z         if contiguous:
2025-05-07T20:31:48.5718743Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5719004Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5719249Z     
2025-05-07T20:31:48.5719443Z         if scale_ub is not None:
2025-05-07T20:31:48.5719724Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5720064Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5720457Z             )
2025-05-07T20:31:48.5720657Z         else:
2025-05-07T20:31:48.5720872Z             scale_ub_tensor = None
2025-05-07T20:31:48.5721121Z     
2025-05-07T20:31:48.5721362Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5721680Z             op = silu_mul_quant
2025-05-07T20:31:48.5721932Z             if compiled:
2025-05-07T20:31:48.5722186Z                 op = torch.compile(op)
2025-05-07T20:31:48.5722487Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5722764Z     
2025-05-07T20:31:48.5722956Z         y_fp8, y_scale = fn()
2025-05-07T20:31:48.5723254Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:48.5723547Z     
2025-05-07T20:31:48.5723789Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5724130Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:48.5724427Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:48.5724746Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:48.5725114Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5725427Z     
2025-05-07T20:31:48.5725635Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:48.5725841Z 
2025-05-07T20:31:48.5725943Z moe/activation_test.py:126: 
2025-05-07T20:31:48.5726247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5726636Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:48.5726967Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5727766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:48.5728536Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:48.5729087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5729785Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5730490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:48.5731318Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5732082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:48.5732844Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5733587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:48.5734237Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:48.5734845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:48.5735370Z     fn()
2025-05-07T20:31:48.5735897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:48.5736506Z     self.fn.run(
2025-05-07T20:31:48.5737012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5737551Z     kernel = self.compile(
2025-05-07T20:31:48.5738165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5738826Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5739228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5739456Z 
2025-05-07T20:31:48.5739671Z self = <triton.compiler.compiler.ASTSource object at 0x7fc53f5f1780>
2025-05-07T20:31:48.5740895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5742291Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc53e480d30>}
2025-05-07T20:31:48.5743661Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5744706Z context = <triton._C.libtriton.ir.context object at 0x7fc5152be170>
2025-05-07T20:31:48.5744995Z 
2025-05-07T20:31:48.5745174Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5745709Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5746181Z                            module_map=module_map)
2025-05-07T20:31:48.5746562Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5746926Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:48.5747193Z E       ^
2025-05-07T20:31:48.5747673Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5748135Z 
2025-05-07T20:31:48.5748556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5749074Z 
2025-05-07T20:31:48.5749185Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5749602Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5750009Z     T=128,
2025-05-07T20:31:48.5750201Z     D=7168,
2025-05-07T20:31:48.5750397Z     scale_ub=None,
2025-05-07T20:31:48.5750615Z     contiguous=False,
2025-05-07T20:31:48.5750851Z     compiled=False,
2025-05-07T20:31:48.5751059Z )
2025-05-07T20:31:48.5751385Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5751883Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.5752155Z 
2025-05-07T20:31:48.5752324Z     @given(
2025-05-07T20:31:48.5752553Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5752869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5753180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5753509Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5753843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5754130Z     )
2025-05-07T20:31:48.5754486Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5754929Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5755172Z         self,
2025-05-07T20:31:48.5755369Z         T: int,
2025-05-07T20:31:48.5755787Z         D: int,
2025-05-07T20:31:48.5756122Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5756443Z         contiguous: bool,
2025-05-07T20:31:48.5756722Z         compiled: bool,
2025-05-07T20:31:48.5756948Z     ) -> None:
2025-05-07T20:31:48.5757168Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5757416Z     
2025-05-07T20:31:48.5757698Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5758044Z     
2025-05-07T20:31:48.5758236Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5758533Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5758843Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5759079Z         x0 = x[:, :D]
2025-05-07T20:31:48.5759299Z         x1 = x[:, D:]
2025-05-07T20:31:48.5759508Z     
2025-05-07T20:31:48.5759690Z         if contiguous:
2025-05-07T20:31:48.5759927Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5760197Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5760440Z     
2025-05-07T20:31:48.5760632Z         if scale_ub is not None:
2025-05-07T20:31:48.5761057Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5761405Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5761712Z             )
2025-05-07T20:31:48.5761915Z         else:
2025-05-07T20:31:48.5762134Z             scale_ub_tensor = None
2025-05-07T20:31:48.5762386Z     
2025-05-07T20:31:48.5762623Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5762944Z             op = silu_mul_quant
2025-05-07T20:31:48.5763201Z             if compiled:
2025-05-07T20:31:48.5763454Z                 op = torch.compile(op)
2025-05-07T20:31:48.5763756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5764027Z     
2025-05-07T20:31:48.5764225Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.5764397Z 
2025-05-07T20:31:48.5764497Z moe/activation_test.py:117: 
2025-05-07T20:31:48.5764799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5765125Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.5765424Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5766125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.5766875Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.5767421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5768113Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5768786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5769322Z     kernel = self.compile(
2025-05-07T20:31:48.5769882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5770550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5770950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5771182Z 
2025-05-07T20:31:48.5771393Z self = <triton.compiler.compiler.ASTSource object at 0x7fc516e084f0>
2025-05-07T20:31:48.5772619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5774012Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc5175b0430>}
2025-05-07T20:31:48.5775373Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5776410Z context = <triton._C.libtriton.ir.context object at 0x7fc515301370>
2025-05-07T20:31:48.5776706Z 
2025-05-07T20:31:48.5776881Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5777415Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5777899Z                            module_map=module_map)
2025-05-07T20:31:48.5778326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5778690Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.5778953Z E       ^
2025-05-07T20:31:48.5779424Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5779883Z 
2025-05-07T20:31:48.5780305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5780831Z 
2025-05-07T20:31:48.5780938Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5781361Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5781850Z     T=4096,
2025-05-07T20:31:48.5782045Z     D=5120,
2025-05-07T20:31:48.5782241Z     scale_ub=1200.0,
2025-05-07T20:31:48.5782466Z     contiguous=True,
2025-05-07T20:31:48.5782701Z     compiled=False,
2025-05-07T20:31:48.5782913Z )
2025-05-07T20:31:48.5783238Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5783740Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.5784019Z 
2025-05-07T20:31:48.5784097Z     @given(
2025-05-07T20:31:48.5784337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5784657Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5784968Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5785306Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5785638Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5785930Z     )
2025-05-07T20:31:48.5786306Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5786786Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5787037Z         self,
2025-05-07T20:31:48.5787239Z         T: int,
2025-05-07T20:31:48.5787436Z         D: int,
2025-05-07T20:31:48.5787667Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5787946Z         contiguous: bool,
2025-05-07T20:31:48.5788188Z         compiled: bool,
2025-05-07T20:31:48.5788417Z     ) -> None:
2025-05-07T20:31:48.5788637Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5788883Z     
2025-05-07T20:31:48.5789162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5789511Z     
2025-05-07T20:31:48.5789714Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5790007Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5790322Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5790569Z         x0 = x[:, :D]
2025-05-07T20:31:48.5790785Z         x1 = x[:, D:]
2025-05-07T20:31:48.5791001Z     
2025-05-07T20:31:48.5791194Z         if contiguous:
2025-05-07T20:31:48.5791427Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5791692Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5792030Z     
2025-05-07T20:31:48.5792226Z         if scale_ub is not None:
2025-05-07T20:31:48.5792507Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5792850Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5793156Z             )
2025-05-07T20:31:48.5793355Z         else:
2025-05-07T20:31:48.5793568Z             scale_ub_tensor = None
2025-05-07T20:31:48.5793819Z     
2025-05-07T20:31:48.5794057Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5794374Z             op = silu_mul_quant
2025-05-07T20:31:48.5794627Z             if compiled:
2025-05-07T20:31:48.5794875Z                 op = torch.compile(op)
2025-05-07T20:31:48.5795178Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5795457Z     
2025-05-07T20:31:48.5795655Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.5795830Z 
2025-05-07T20:31:48.5795931Z moe/activation_test.py:117: 
2025-05-07T20:31:48.5796232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5796590Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.5796903Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5797603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.5798304Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.5798851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5799541Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5800218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5800839Z     kernel = self.compile(
2025-05-07T20:31:48.5801396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5802071Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5802472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5802698Z 
2025-05-07T20:31:48.5802911Z self = <triton.compiler.compiler.ASTSource object at 0x7fc5170d66e0>
2025-05-07T20:31:48.5804004Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5805396Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc52c335b40>}
2025-05-07T20:31:48.5806762Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5807810Z context = <triton._C.libtriton.ir.context object at 0x7fc51530f2f0>
2025-05-07T20:31:48.5807815Z 
2025-05-07T20:31:48.5807984Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5808253Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5808370Z                            module_map=module_map)
2025-05-07T20:31:48.5808536Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5808637Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.5808720Z E       ^
2025-05-07T20:31:48.5809081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5809085Z 
2025-05-07T20:31:48.5809516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5809629Z 
2025-05-07T20:31:48.5809736Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5809962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5810045Z     T=1,
2025-05-07T20:31:48.5810122Z     D=5120,
2025-05-07T20:31:48.5810205Z     scale_ub=None,
2025-05-07T20:31:48.5810296Z     contiguous=True,
2025-05-07T20:31:48.5810380Z     compiled=True,
2025-05-07T20:31:48.5810459Z )
2025-05-07T20:31:48.5810682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5810845Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.5810849Z 
2025-05-07T20:31:48.5810932Z     @given(
2025-05-07T20:31:48.5811053Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5811159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5811283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5811402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5811523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5811606Z     )
2025-05-07T20:31:48.5811856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5811958Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5812035Z         self,
2025-05-07T20:31:48.5812112Z         T: int,
2025-05-07T20:31:48.5812195Z         D: int,
2025-05-07T20:31:48.5812295Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5812387Z         contiguous: bool,
2025-05-07T20:31:48.5812479Z         compiled: bool,
2025-05-07T20:31:48.5812559Z     ) -> None:
2025-05-07T20:31:48.5812659Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5812737Z     
2025-05-07T20:31:48.5812910Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5813076Z     
2025-05-07T20:31:48.5813176Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5813305Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5813405Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5813488Z         x0 = x[:, :D]
2025-05-07T20:31:48.5813569Z         x1 = x[:, D:]
2025-05-07T20:31:48.5813650Z     
2025-05-07T20:31:48.5813734Z         if contiguous:
2025-05-07T20:31:48.5813828Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5813922Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5813995Z     
2025-05-07T20:31:48.5814088Z         if scale_ub is not None:
2025-05-07T20:31:48.5814201Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5814339Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5814417Z             )
2025-05-07T20:31:48.5814499Z         else:
2025-05-07T20:31:48.5814593Z             scale_ub_tensor = None
2025-05-07T20:31:48.5814669Z     
2025-05-07T20:31:48.5814805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5814897Z             op = silu_mul_quant
2025-05-07T20:31:48.5814987Z             if compiled:
2025-05-07T20:31:48.5815100Z                 op = torch.compile(op)
2025-05-07T20:31:48.5815207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5815288Z     
2025-05-07T20:31:48.5820442Z         y_fp8, y_scale = fn()
2025-05-07T20:31:48.5820591Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:48.5820663Z     
2025-05-07T20:31:48.5820807Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5820913Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:48.5821012Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:48.5821138Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:48.5821281Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5821352Z     
2025-05-07T20:31:48.5821453Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:48.5821465Z 
2025-05-07T20:31:48.5821565Z moe/activation_test.py:126: 
2025-05-07T20:31:48.5821695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5821917Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:48.5822056Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5822639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:48.5822740Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:48.5823108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5823339Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5823713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:48.5823975Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5824387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:48.5824648Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5825039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:48.5825209Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:48.5825561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:48.5825643Z     fn()
2025-05-07T20:31:48.5826054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:48.5826144Z     self.fn.run(
2025-05-07T20:31:48.5826593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5826704Z     kernel = self.compile(
2025-05-07T20:31:48.5827107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5827289Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5827416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5827420Z 
2025-05-07T20:31:48.5827633Z self = <triton.compiler.compiler.ASTSource object at 0x7fc515a29780>
2025-05-07T20:31:48.5828428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5828950Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc517344820>}
2025-05-07T20:31:48.5829715Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5829920Z context = <triton._C.libtriton.ir.context object at 0x7fc514efdcb0>
2025-05-07T20:31:48.5829924Z 
2025-05-07T20:31:48.5830089Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5830360Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5830475Z                            module_map=module_map)
2025-05-07T20:31:48.5830641Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5830748Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:48.5830829Z E       ^
2025-05-07T20:31:48.5831196Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5831201Z 
2025-05-07T20:31:48.5831629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5831720Z 
2025-05-07T20:31:48.5831825Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5832056Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5832137Z     T=2048,
2025-05-07T20:31:48.5832213Z     D=5120,
2025-05-07T20:31:48.5832295Z     scale_ub=None,
2025-05-07T20:31:48.5832387Z     contiguous=True,
2025-05-07T20:31:48.5832468Z     compiled=True,
2025-05-07T20:31:48.5832544Z )
2025-05-07T20:31:48.5832768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5832943Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.5832947Z 
2025-05-07T20:31:48.5833026Z     @given(
2025-05-07T20:31:48.5833152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5833252Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5833373Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5833499Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5833617Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5833693Z     )
2025-05-07T20:31:48.5833944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5834044Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5834119Z         self,
2025-05-07T20:31:48.5834193Z         T: int,
2025-05-07T20:31:48.5834270Z         D: int,
2025-05-07T20:31:48.5834371Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5834462Z         contiguous: bool,
2025-05-07T20:31:48.5834552Z         compiled: bool,
2025-05-07T20:31:48.5834629Z     ) -> None:
2025-05-07T20:31:48.5834726Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5834803Z     
2025-05-07T20:31:48.5835057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5835131Z     
2025-05-07T20:31:48.5835228Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5835359Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5835454Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5835534Z         x0 = x[:, :D]
2025-05-07T20:31:48.5835615Z         x1 = x[:, D:]
2025-05-07T20:31:48.5835693Z     
2025-05-07T20:31:48.5835778Z         if contiguous:
2025-05-07T20:31:48.5835871Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5835964Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5836037Z     
2025-05-07T20:31:48.5836128Z         if scale_ub is not None:
2025-05-07T20:31:48.5836244Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5836379Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5836454Z             )
2025-05-07T20:31:48.5836531Z         else:
2025-05-07T20:31:48.5836630Z             scale_ub_tensor = None
2025-05-07T20:31:48.5836701Z     
2025-05-07T20:31:48.5836839Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5836934Z             op = silu_mul_quant
2025-05-07T20:31:48.5837022Z             if compiled:
2025-05-07T20:31:48.5837121Z                 op = torch.compile(op)
2025-05-07T20:31:48.5837227Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5837302Z     
2025-05-07T20:31:48.5837394Z         y_fp8, y_scale = fn()
2025-05-07T20:31:48.5837517Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:48.5837593Z     
2025-05-07T20:31:48.5837731Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5837837Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:48.5837943Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:48.5838066Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:48.5838212Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5838289Z     
2025-05-07T20:31:48.5838388Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:48.5838393Z 
2025-05-07T20:31:48.5838491Z moe/activation_test.py:126: 
2025-05-07T20:31:48.5838712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5838818Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:48.5838957Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5839532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:48.5839633Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:48.5840007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5840234Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5840616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:48.5840875Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5841286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:48.5841544Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5841925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:48.5842099Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:48.5842448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:48.5842524Z     fn()
2025-05-07T20:31:48.5843037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:48.5843118Z     self.fn.run(
2025-05-07T20:31:48.5843462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5843564Z     kernel = self.compile(
2025-05-07T20:31:48.5843953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5844131Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5844256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5844260Z 
2025-05-07T20:31:48.5844469Z self = <triton.compiler.compiler.ASTSource object at 0x7fc53d984070>
2025-05-07T20:31:48.5845268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5845785Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc516e52a70>}
2025-05-07T20:31:48.5846582Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5846801Z context = <triton._C.libtriton.ir.context object at 0x7fc514bd4970>
2025-05-07T20:31:48.5846806Z 
2025-05-07T20:31:48.5846973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5847245Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5847351Z                            module_map=module_map)
2025-05-07T20:31:48.5847520Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5847623Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:48.5847705Z E       ^
2025-05-07T20:31:48.5848072Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5848184Z 
2025-05-07T20:31:48.5848608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5848613Z 
2025-05-07T20:31:48.5848723Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5848950Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5849026Z     T=128,
2025-05-07T20:31:48.5849109Z     D=5120,
2025-05-07T20:31:48.5849191Z     scale_ub=None,
2025-05-07T20:31:48.5849273Z     contiguous=True,
2025-05-07T20:31:48.5849360Z     compiled=True,
2025-05-07T20:31:48.5849435Z )
2025-05-07T20:31:48.5849657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5849837Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.5849842Z 
2025-05-07T20:31:48.5849914Z     @given(
2025-05-07T20:31:48.5850040Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5850144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5850260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5850381Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5850497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5850569Z     )
2025-05-07T20:31:48.5850823Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5850920Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5850996Z         self,
2025-05-07T20:31:48.5851074Z         T: int,
2025-05-07T20:31:48.5851149Z         D: int,
2025-05-07T20:31:48.5851247Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5851340Z         contiguous: bool,
2025-05-07T20:31:48.5851426Z         compiled: bool,
2025-05-07T20:31:48.5851591Z     ) -> None:
2025-05-07T20:31:48.5851689Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5851760Z     
2025-05-07T20:31:48.5851936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5852012Z     
2025-05-07T20:31:48.5852104Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5852235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5852325Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5852405Z         x0 = x[:, :D]
2025-05-07T20:31:48.5852486Z         x1 = x[:, D:]
2025-05-07T20:31:48.5852559Z     
2025-05-07T20:31:48.5852641Z         if contiguous:
2025-05-07T20:31:48.5852736Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5852825Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5852896Z     
2025-05-07T20:31:48.5852990Z         if scale_ub is not None:
2025-05-07T20:31:48.5853097Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5853239Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5853319Z             )
2025-05-07T20:31:48.5853393Z         else:
2025-05-07T20:31:48.5853490Z             scale_ub_tensor = None
2025-05-07T20:31:48.5853565Z     
2025-05-07T20:31:48.5853695Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5853790Z             op = silu_mul_quant
2025-05-07T20:31:48.5853874Z             if compiled:
2025-05-07T20:31:48.5853973Z                 op = torch.compile(op)
2025-05-07T20:31:48.5854087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5854158Z     
2025-05-07T20:31:48.5854249Z         y_fp8, y_scale = fn()
2025-05-07T20:31:48.5854376Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:48.5854447Z     
2025-05-07T20:31:48.5854590Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5854691Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:48.5854790Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:48.5854922Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:48.5855062Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5855133Z     
2025-05-07T20:31:48.5855323Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:48.5855327Z 
2025-05-07T20:31:48.5855424Z moe/activation_test.py:126: 
2025-05-07T20:31:48.5855774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5855937Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:48.5856112Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5856698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:48.5856811Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:48.5857198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5857431Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5857803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:48.5858124Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5858527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:48.5858779Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5859159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:48.5859326Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:48.5859677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:48.5859755Z     fn()
2025-05-07T20:31:48.5860300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:48.5860391Z     self.fn.run(
2025-05-07T20:31:48.5860739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5860833Z     kernel = self.compile(
2025-05-07T20:31:48.5861222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5861399Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5861527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5861532Z 
2025-05-07T20:31:48.5861739Z self = <triton.compiler.compiler.ASTSource object at 0x7fc516d3bc40>
2025-05-07T20:31:48.5862530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5863044Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc514dd11b0>}
2025-05-07T20:31:48.5863803Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5863998Z context = <triton._C.libtriton.ir.context object at 0x7fc5145af670>
2025-05-07T20:31:48.5864003Z 
2025-05-07T20:31:48.5864169Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5864435Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5864550Z                            module_map=module_map)
2025-05-07T20:31:48.5864720Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5864827Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:48.5864903Z E       ^
2025-05-07T20:31:48.5865380Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5865385Z 
2025-05-07T20:31:48.5865808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5865812Z 
2025-05-07T20:31:48.5865916Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5866144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5866222Z     T=4096,
2025-05-07T20:31:48.5866299Z     D=5120,
2025-05-07T20:31:48.5866388Z     scale_ub=None,
2025-05-07T20:31:48.5866475Z     contiguous=True,
2025-05-07T20:31:48.5866558Z     compiled=True,
2025-05-07T20:31:48.5866637Z )
2025-05-07T20:31:48.5866862Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5867033Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.5867038Z 
2025-05-07T20:31:48.5867122Z     @given(
2025-05-07T20:31:48.5867241Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5867345Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5867462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5867581Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5867701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5867775Z     )
2025-05-07T20:31:48.5868024Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5868123Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5868198Z         self,
2025-05-07T20:31:48.5868274Z         T: int,
2025-05-07T20:31:48.5868357Z         D: int,
2025-05-07T20:31:48.5868456Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5868630Z         contiguous: bool,
2025-05-07T20:31:48.5868721Z         compiled: bool,
2025-05-07T20:31:48.5868801Z     ) -> None:
2025-05-07T20:31:48.5868901Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5868982Z     
2025-05-07T20:31:48.5869151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5869228Z     
2025-05-07T20:31:48.5869323Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5869449Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5869546Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5869625Z         x0 = x[:, :D]
2025-05-07T20:31:48.5869705Z         x1 = x[:, D:]
2025-05-07T20:31:48.5869779Z     
2025-05-07T20:31:48.5869862Z         if contiguous:
2025-05-07T20:31:48.5869955Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5870048Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5870125Z     
2025-05-07T20:31:48.5870214Z         if scale_ub is not None:
2025-05-07T20:31:48.5870330Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5870468Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5870546Z             )
2025-05-07T20:31:48.5870626Z         else:
2025-05-07T20:31:48.5870720Z             scale_ub_tensor = None
2025-05-07T20:31:48.5870796Z     
2025-05-07T20:31:48.5870926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5871017Z             op = silu_mul_quant
2025-05-07T20:31:48.5871104Z             if compiled:
2025-05-07T20:31:48.5871204Z                 op = torch.compile(op)
2025-05-07T20:31:48.5871311Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5871387Z     
2025-05-07T20:31:48.5871478Z         y_fp8, y_scale = fn()
2025-05-07T20:31:48.5871600Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:48.5871677Z     
2025-05-07T20:31:48.5871814Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5871918Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:48.5872021Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:48.5872144Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:48.5872373Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5872449Z     
2025-05-07T20:31:48.5872548Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:48.5872553Z 
2025-05-07T20:31:48.5872652Z moe/activation_test.py:126: 
2025-05-07T20:31:48.5872779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5872890Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:48.5873026Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5873588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:48.5873696Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:48.5874064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5874288Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5874668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:48.5874923Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5875332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:48.5875586Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5875965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:48.5876138Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:48.5876691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:48.5876775Z     fn()
2025-05-07T20:31:48.5877178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:48.5877268Z     self.fn.run(
2025-05-07T20:31:48.5877611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5877706Z     kernel = self.compile(
2025-05-07T20:31:48.5878089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5878267Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5878393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5878398Z 
2025-05-07T20:31:48.5878607Z self = <triton.compiler.compiler.ASTSource object at 0x7fc51521d780>
2025-05-07T20:31:48.5879397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5879911Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc516e51d80>}
2025-05-07T20:31:48.5880672Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5880868Z context = <triton._C.libtriton.ir.context object at 0x7fc50ff5ea30>
2025-05-07T20:31:48.5880872Z 
2025-05-07T20:31:48.5881042Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5881308Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5881423Z                            module_map=module_map)
2025-05-07T20:31:48.5881591Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5881775Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:48.5881854Z E       ^
2025-05-07T20:31:48.5882213Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5882218Z 
2025-05-07T20:31:48.5882638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5882643Z 
2025-05-07T20:31:48.5882754Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5882977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5883060Z     T=16384,
2025-05-07T20:31:48.5883136Z     D=5120,
2025-05-07T20:31:48.5883217Z     scale_ub=None,
2025-05-07T20:31:48.5883306Z     contiguous=True,
2025-05-07T20:31:48.5883393Z     compiled=True,
2025-05-07T20:31:48.5883468Z )
2025-05-07T20:31:48.5883692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5883871Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.5883876Z 
2025-05-07T20:31:48.5883952Z     @given(
2025-05-07T20:31:48.5884076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5884177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5884296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5884414Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5884528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5884606Z     )
2025-05-07T20:31:48.5884854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5884950Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5885033Z         self,
2025-05-07T20:31:48.5885110Z         T: int,
2025-05-07T20:31:48.5885268Z         D: int,
2025-05-07T20:31:48.5885375Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5885465Z         contiguous: bool,
2025-05-07T20:31:48.5885563Z         compiled: bool,
2025-05-07T20:31:48.5885642Z     ) -> None:
2025-05-07T20:31:48.5885738Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5885816Z     
2025-05-07T20:31:48.5885986Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5886061Z     
2025-05-07T20:31:48.5886162Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5886286Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5886380Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5886477Z         x0 = x[:, :D]
2025-05-07T20:31:48.5886567Z         x1 = x[:, D:]
2025-05-07T20:31:48.5886656Z     
2025-05-07T20:31:48.5886753Z         if contiguous:
2025-05-07T20:31:48.5886845Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5886936Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5887009Z     
2025-05-07T20:31:48.5887108Z         if scale_ub is not None:
2025-05-07T20:31:48.5887217Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5887354Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5887432Z             )
2025-05-07T20:31:48.5887515Z         else:
2025-05-07T20:31:48.5887609Z             scale_ub_tensor = None
2025-05-07T20:31:48.5887682Z     
2025-05-07T20:31:48.5887818Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5887909Z             op = silu_mul_quant
2025-05-07T20:31:48.5887999Z             if compiled:
2025-05-07T20:31:48.5888101Z                 op = torch.compile(op)
2025-05-07T20:31:48.5888210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5888290Z     
2025-05-07T20:31:48.5888382Z         y_fp8, y_scale = fn()
2025-05-07T20:31:48.5888504Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:48.5888579Z     
2025-05-07T20:31:48.5888719Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5888820Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:48.5888923Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:48.5889127Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:48.5889272Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5889345Z     
2025-05-07T20:31:48.5889445Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:48.5889450Z 
2025-05-07T20:31:48.5889551Z moe/activation_test.py:126: 
2025-05-07T20:31:48.5889677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5889783Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:48.5889921Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5890485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:48.5890598Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:48.5890962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5891190Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5891565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:48.5891820Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5892221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:48.5892475Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5892852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:48.5893103Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:48.5893450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:48.5893532Z     fn()
2025-05-07T20:31:48.5893940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:48.5894022Z     self.fn.run(
2025-05-07T20:31:48.5894367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5894462Z     kernel = self.compile(
2025-05-07T20:31:48.5894845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5895026Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5895149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5895154Z 
2025-05-07T20:31:48.5895366Z self = <triton.compiler.compiler.ASTSource object at 0x7fc5148f99f0>
2025-05-07T20:31:48.5896155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5896665Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc51510fe20>}
2025-05-07T20:31:48.5897419Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5897612Z context = <triton._C.libtriton.ir.context object at 0x7fc50f9a1730>
2025-05-07T20:31:48.5897616Z 
2025-05-07T20:31:48.5897784Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5898120Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5898230Z                            module_map=module_map)
2025-05-07T20:31:48.5898485Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5898589Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:48.5898667Z E       ^
2025-05-07T20:31:48.5899031Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5899035Z 
2025-05-07T20:31:48.5899451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5899456Z 
2025-05-07T20:31:48.5899567Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5899791Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5899868Z     T=1,
2025-05-07T20:31:48.5899948Z     D=5120,
2025-05-07T20:31:48.5900037Z     scale_ub=1200.0,
2025-05-07T20:31:48.5900120Z     contiguous=True,
2025-05-07T20:31:48.5900209Z     compiled=True,
2025-05-07T20:31:48.5900281Z )
2025-05-07T20:31:48.5900505Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5900675Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.5900680Z 
2025-05-07T20:31:48.5900756Z     @given(
2025-05-07T20:31:48.5900877Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5900975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5901092Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5901214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5901328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5901404Z     )
2025-05-07T20:31:48.5901656Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5901832Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5901913Z         self,
2025-05-07T20:31:48.5901990Z         T: int,
2025-05-07T20:31:48.5902067Z         D: int,
2025-05-07T20:31:48.5902171Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5902262Z         contiguous: bool,
2025-05-07T20:31:48.5902348Z         compiled: bool,
2025-05-07T20:31:48.5902433Z     ) -> None:
2025-05-07T20:31:48.5902529Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5902603Z     
2025-05-07T20:31:48.5902778Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5902853Z     
2025-05-07T20:31:48.5902946Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5903075Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5903162Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5903247Z         x0 = x[:, :D]
2025-05-07T20:31:48.5903326Z         x1 = x[:, D:]
2025-05-07T20:31:48.5903400Z     
2025-05-07T20:31:48.5903489Z         if contiguous:
2025-05-07T20:31:48.5903585Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5903674Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5903749Z     
2025-05-07T20:31:48.5903841Z         if scale_ub is not None:
2025-05-07T20:31:48.5903950Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5904089Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5904164Z             )
2025-05-07T20:31:48.5904240Z         else:
2025-05-07T20:31:48.5904340Z             scale_ub_tensor = None
2025-05-07T20:31:48.5904413Z     
2025-05-07T20:31:48.5904542Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5904635Z             op = silu_mul_quant
2025-05-07T20:31:48.5904720Z             if compiled:
2025-05-07T20:31:48.5904825Z                 op = torch.compile(op)
2025-05-07T20:31:48.5904931Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5905004Z     
2025-05-07T20:31:48.5905101Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.5905105Z 
2025-05-07T20:31:48.5905209Z moe/activation_test.py:117: 
2025-05-07T20:31:48.5905336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5905440Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.5905628Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5906000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.5906095Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.5906644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.5906749Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.5907109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5907329Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5907679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5907773Z     kernel = self.compile(
2025-05-07T20:31:48.5908162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5908341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5908465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5908469Z 
2025-05-07T20:31:48.5908678Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50fc2ac80>
2025-05-07T20:31:48.5909463Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5910074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50fdb36d0>}
2025-05-07T20:31:48.5910832Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5911029Z context = <triton._C.libtriton.ir.context object at 0x7fc50f4c9870>
2025-05-07T20:31:48.5911034Z 
2025-05-07T20:31:48.5911202Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5911467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5911580Z                            module_map=module_map)
2025-05-07T20:31:48.5911743Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5911841Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.5911922Z E       ^
2025-05-07T20:31:48.5912283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5912288Z 
2025-05-07T20:31:48.5912710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5912719Z 
2025-05-07T20:31:48.5912823Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5913049Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5913130Z     T=1,
2025-05-07T20:31:48.5913206Z     D=5120,
2025-05-07T20:31:48.5913289Z     scale_ub=None,
2025-05-07T20:31:48.5913379Z     contiguous=False,
2025-05-07T20:31:48.5913466Z     compiled=True,
2025-05-07T20:31:48.5913538Z )
2025-05-07T20:31:48.5913764Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5913927Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:48.5913931Z 
2025-05-07T20:31:48.5914012Z     @given(
2025-05-07T20:31:48.5914135Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5914237Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5914356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5914555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5914671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5914748Z     )
2025-05-07T20:31:48.5914993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5915089Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5915169Z         self,
2025-05-07T20:31:48.5915244Z         T: int,
2025-05-07T20:31:48.5915321Z         D: int,
2025-05-07T20:31:48.5915417Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5915506Z         contiguous: bool,
2025-05-07T20:31:48.5915592Z         compiled: bool,
2025-05-07T20:31:48.5915671Z     ) -> None:
2025-05-07T20:31:48.5915766Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5915840Z     
2025-05-07T20:31:48.5916016Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5916087Z     
2025-05-07T20:31:48.5916183Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5916314Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5916416Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5916507Z         x0 = x[:, :D]
2025-05-07T20:31:48.5916598Z         x1 = x[:, D:]
2025-05-07T20:31:48.5916678Z     
2025-05-07T20:31:48.5916764Z         if contiguous:
2025-05-07T20:31:48.5916853Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5916946Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5917018Z     
2025-05-07T20:31:48.5917108Z         if scale_ub is not None:
2025-05-07T20:31:48.5917217Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5917351Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5917424Z             )
2025-05-07T20:31:48.5917500Z         else:
2025-05-07T20:31:48.5917594Z             scale_ub_tensor = None
2025-05-07T20:31:48.5917745Z     
2025-05-07T20:31:48.5917881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5917968Z             op = silu_mul_quant
2025-05-07T20:31:48.5918058Z             if compiled:
2025-05-07T20:31:48.5918157Z                 op = torch.compile(op)
2025-05-07T20:31:48.5918263Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5918338Z     
2025-05-07T20:31:48.5918431Z         y_fp8, y_scale = fn()
2025-05-07T20:31:48.5918550Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:48.5918624Z     
2025-05-07T20:31:48.5918759Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5918857Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:48.5918959Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:48.5919078Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:48.5919216Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5919297Z     
2025-05-07T20:31:48.5919397Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:48.5919401Z 
2025-05-07T20:31:48.5919507Z moe/activation_test.py:126: 
2025-05-07T20:31:48.5919634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5919740Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:48.5919876Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.5920440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:48.5920541Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:48.5920906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5921127Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5921503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:48.5921758Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5922242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:48.5922503Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.5922878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:48.5923047Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:48.5923390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:48.5923464Z     fn()
2025-05-07T20:31:48.5923871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:48.5923954Z     self.fn.run(
2025-05-07T20:31:48.5924295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5924395Z     kernel = self.compile(
2025-05-07T20:31:48.5924777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5924954Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5925076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5925081Z 
2025-05-07T20:31:48.5925287Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50fc2a620>
2025-05-07T20:31:48.5926075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5926657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc51510fac0>}
2025-05-07T20:31:48.5927418Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5927610Z context = <triton._C.libtriton.ir.context object at 0x7fc50f40cab0>
2025-05-07T20:31:48.5927615Z 
2025-05-07T20:31:48.5927781Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5928049Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5928157Z                            module_map=module_map)
2025-05-07T20:31:48.5928322Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5928425Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:48.5928504Z E       ^
2025-05-07T20:31:48.5928865Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5928874Z 
2025-05-07T20:31:48.5929295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5929300Z 
2025-05-07T20:31:48.5929403Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5929625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5929699Z     T=1,
2025-05-07T20:31:48.5929776Z     D=5120,
2025-05-07T20:31:48.5929858Z     scale_ub=None,
2025-05-07T20:31:48.5929940Z     contiguous=True,
2025-05-07T20:31:48.5930030Z     compiled=False,
2025-05-07T20:31:48.5930101Z )
2025-05-07T20:31:48.5930321Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5930488Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.5930497Z 
2025-05-07T20:31:48.5930571Z     @given(
2025-05-07T20:31:48.5930692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5930872Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5930986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5931106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5931222Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5931293Z     )
2025-05-07T20:31:48.5931549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5931641Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5931716Z         self,
2025-05-07T20:31:48.5931795Z         T: int,
2025-05-07T20:31:48.5931869Z         D: int,
2025-05-07T20:31:48.5931968Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5932057Z         contiguous: bool,
2025-05-07T20:31:48.5932140Z         compiled: bool,
2025-05-07T20:31:48.5932223Z     ) -> None:
2025-05-07T20:31:48.5932321Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5932394Z     
2025-05-07T20:31:48.5932566Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5932643Z     
2025-05-07T20:31:48.5932733Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5932860Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5932948Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5933026Z         x0 = x[:, :D]
2025-05-07T20:31:48.5933105Z         x1 = x[:, D:]
2025-05-07T20:31:48.5933176Z     
2025-05-07T20:31:48.5933258Z         if contiguous:
2025-05-07T20:31:48.5933353Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5933439Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5933514Z     
2025-05-07T20:31:48.5933604Z         if scale_ub is not None:
2025-05-07T20:31:48.5933707Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5933844Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5933998Z             )
2025-05-07T20:31:48.5934074Z         else:
2025-05-07T20:31:48.5934171Z             scale_ub_tensor = None
2025-05-07T20:31:48.5934242Z     
2025-05-07T20:31:48.5934375Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5934468Z             op = silu_mul_quant
2025-05-07T20:31:48.5934550Z             if compiled:
2025-05-07T20:31:48.5934650Z                 op = torch.compile(op)
2025-05-07T20:31:48.5934757Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5934827Z     
2025-05-07T20:31:48.5934920Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.5934925Z 
2025-05-07T20:31:48.5935022Z moe/activation_test.py:117: 
2025-05-07T20:31:48.5935148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5935251Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.5935349Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5935860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.5935961Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.5936322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5936576Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5936944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5937036Z     kernel = self.compile(
2025-05-07T20:31:48.5937422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5937596Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5937718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5937725Z 
2025-05-07T20:31:48.5937932Z self = <triton.compiler.compiler.ASTSource object at 0x7fc5148f9870>
2025-05-07T20:31:48.5938765Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5939385Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50fdb3760>}
2025-05-07T20:31:48.5940140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5940335Z context = <triton._C.libtriton.ir.context object at 0x7fc50f802d70>
2025-05-07T20:31:48.5940340Z 
2025-05-07T20:31:48.5940504Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5940775Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5940885Z                            module_map=module_map)
2025-05-07T20:31:48.5941053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5941153Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.5941226Z E       ^
2025-05-07T20:31:48.5941581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5941586Z 
2025-05-07T20:31:48.5942006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5942010Z 
2025-05-07T20:31:48.5942112Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5942335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5942412Z     T=128,
2025-05-07T20:31:48.5942485Z     D=5120,
2025-05-07T20:31:48.5942571Z     scale_ub=None,
2025-05-07T20:31:48.5942733Z     contiguous=False,
2025-05-07T20:31:48.5942817Z     compiled=True,
2025-05-07T20:31:48.5942890Z )
2025-05-07T20:31:48.5943109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5943284Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:48.5943288Z 
2025-05-07T20:31:48.5943366Z     @given(
2025-05-07T20:31:48.5943481Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5943580Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5943697Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5943814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5943929Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5944002Z     )
2025-05-07T20:31:48.5944246Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5944342Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5944420Z         self,
2025-05-07T20:31:48.5944496Z         T: int,
2025-05-07T20:31:48.5944573Z         D: int,
2025-05-07T20:31:48.5944671Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5944763Z         contiguous: bool,
2025-05-07T20:31:48.5944849Z         compiled: bool,
2025-05-07T20:31:48.5944926Z     ) -> None:
2025-05-07T20:31:48.5945018Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5945092Z     
2025-05-07T20:31:48.5945261Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5945335Z     
2025-05-07T20:31:48.5945424Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5945550Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5945640Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5945717Z         x0 = x[:, :D]
2025-05-07T20:31:48.5945794Z         x1 = x[:, D:]
2025-05-07T20:31:48.5945868Z     
2025-05-07T20:31:48.5945950Z         if contiguous:
2025-05-07T20:31:48.5946042Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5946136Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5946208Z     
2025-05-07T20:31:48.5946301Z         if scale_ub is not None:
2025-05-07T20:31:48.5950762Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5950919Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5950994Z             )
2025-05-07T20:31:48.5951073Z         else:
2025-05-07T20:31:48.5951166Z             scale_ub_tensor = None
2025-05-07T20:31:48.5951235Z     
2025-05-07T20:31:48.5951372Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5951460Z             op = silu_mul_quant
2025-05-07T20:31:48.5951545Z             if compiled:
2025-05-07T20:31:48.5951644Z                 op = torch.compile(op)
2025-05-07T20:31:48.5951748Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5951821Z     
2025-05-07T20:31:48.5951908Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.5951913Z 
2025-05-07T20:31:48.5952009Z moe/activation_test.py:117: 
2025-05-07T20:31:48.5952148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5952250Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.5952354Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5952744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.5952838Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.5953346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.5953441Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.5953805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5954032Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5954481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5954580Z     kernel = self.compile(
2025-05-07T20:31:48.5954970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5955151Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5955276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5955280Z 
2025-05-07T20:31:48.5955491Z self = <triton.compiler.compiler.ASTSource object at 0x7fc514da58d0>
2025-05-07T20:31:48.5956605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5957132Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc51444ee60>}
2025-05-07T20:31:48.5957899Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5958101Z context = <triton._C.libtriton.ir.context object at 0x7fc50f8a26b0>
2025-05-07T20:31:48.5958106Z 
2025-05-07T20:31:48.5958274Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5958544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5958649Z                            module_map=module_map)
2025-05-07T20:31:48.5958811Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5958911Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.5958983Z E       ^
2025-05-07T20:31:48.5959346Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5959358Z 
2025-05-07T20:31:48.5959781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5959930Z 
2025-05-07T20:31:48.5960036Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5960270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5960344Z     T=128,
2025-05-07T20:31:48.5960417Z     D=7168,
2025-05-07T20:31:48.5960500Z     scale_ub=1200.0,
2025-05-07T20:31:48.5960585Z     contiguous=False,
2025-05-07T20:31:48.5960665Z     compiled=False,
2025-05-07T20:31:48.5960738Z )
2025-05-07T20:31:48.5960957Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5961136Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.5961141Z 
2025-05-07T20:31:48.5961215Z     @given(
2025-05-07T20:31:48.5961339Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5961439Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5961553Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5961673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5961789Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5961860Z     )
2025-05-07T20:31:48.5962108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5962201Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5962275Z         self,
2025-05-07T20:31:48.5962353Z         T: int,
2025-05-07T20:31:48.5962425Z         D: int,
2025-05-07T20:31:48.5962521Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5962612Z         contiguous: bool,
2025-05-07T20:31:48.5962694Z         compiled: bool,
2025-05-07T20:31:48.5962769Z     ) -> None:
2025-05-07T20:31:48.5962865Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5962934Z     
2025-05-07T20:31:48.5963216Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5963293Z     
2025-05-07T20:31:48.5963383Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5963514Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5963607Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5963684Z         x0 = x[:, :D]
2025-05-07T20:31:48.5963765Z         x1 = x[:, D:]
2025-05-07T20:31:48.5963833Z     
2025-05-07T20:31:48.5963915Z         if contiguous:
2025-05-07T20:31:48.5964005Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5964092Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5964161Z     
2025-05-07T20:31:48.5964252Z         if scale_ub is not None:
2025-05-07T20:31:48.5964356Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5964492Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5964570Z             )
2025-05-07T20:31:48.5964643Z         else:
2025-05-07T20:31:48.5964735Z             scale_ub_tensor = None
2025-05-07T20:31:48.5964815Z     
2025-05-07T20:31:48.5964945Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5965032Z             op = silu_mul_quant
2025-05-07T20:31:48.5965121Z             if compiled:
2025-05-07T20:31:48.5965218Z                 op = torch.compile(op)
2025-05-07T20:31:48.5965326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5965397Z     
2025-05-07T20:31:48.5965485Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.5965490Z 
2025-05-07T20:31:48.5965588Z moe/activation_test.py:117: 
2025-05-07T20:31:48.5965713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5965811Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.5965915Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5966475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.5966577Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.5966947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5967170Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5967607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5967698Z     kernel = self.compile(
2025-05-07T20:31:48.5968086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5968264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5968388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5968393Z 
2025-05-07T20:31:48.5968601Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50fcbd150>
2025-05-07T20:31:48.5969394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5969911Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc51444dab0>}
2025-05-07T20:31:48.5970675Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5970867Z context = <triton._C.libtriton.ir.context object at 0x7fc50f8a9bb0>
2025-05-07T20:31:48.5970871Z 
2025-05-07T20:31:48.5971040Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5971308Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5971489Z                            module_map=module_map)
2025-05-07T20:31:48.5971656Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5971753Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.5971830Z E       ^
2025-05-07T20:31:48.5972196Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5972201Z 
2025-05-07T20:31:48.5972621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5972626Z 
2025-05-07T20:31:48.5972731Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5972955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5973027Z     T=128,
2025-05-07T20:31:48.5973102Z     D=5120,
2025-05-07T20:31:48.5973179Z     scale_ub=None,
2025-05-07T20:31:48.5973263Z     contiguous=False,
2025-05-07T20:31:48.5973348Z     compiled=False,
2025-05-07T20:31:48.5973417Z )
2025-05-07T20:31:48.5973649Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5973820Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.5973829Z 
2025-05-07T20:31:48.5973901Z     @given(
2025-05-07T20:31:48.5974021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5974116Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5974229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5974351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5974464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5974534Z     )
2025-05-07T20:31:48.5974783Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5974876Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5974950Z         self,
2025-05-07T20:31:48.5975023Z         T: int,
2025-05-07T20:31:48.5975094Z         D: int,
2025-05-07T20:31:48.5975197Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5975287Z         contiguous: bool,
2025-05-07T20:31:48.5975369Z         compiled: bool,
2025-05-07T20:31:48.5975558Z     ) -> None:
2025-05-07T20:31:48.5975652Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5975721Z     
2025-05-07T20:31:48.5975895Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5975967Z     
2025-05-07T20:31:48.5976054Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5976180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5976266Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5976347Z         x0 = x[:, :D]
2025-05-07T20:31:48.5976443Z         x1 = x[:, D:]
2025-05-07T20:31:48.5976515Z     
2025-05-07T20:31:48.5976618Z         if contiguous:
2025-05-07T20:31:48.5976715Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5976804Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5976875Z     
2025-05-07T20:31:48.5976968Z         if scale_ub is not None:
2025-05-07T20:31:48.5977071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5977207Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5977284Z             )
2025-05-07T20:31:48.5977357Z         else:
2025-05-07T20:31:48.5977451Z             scale_ub_tensor = None
2025-05-07T20:31:48.5977522Z     
2025-05-07T20:31:48.5977654Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5977741Z             op = silu_mul_quant
2025-05-07T20:31:48.5977824Z             if compiled:
2025-05-07T20:31:48.5977923Z                 op = torch.compile(op)
2025-05-07T20:31:48.5978085Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5978155Z     
2025-05-07T20:31:48.5978247Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.5978252Z 
2025-05-07T20:31:48.5978347Z moe/activation_test.py:117: 
2025-05-07T20:31:48.5978473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5978659Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.5978759Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5979267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.5979368Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.5979731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5979954Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5980298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5980389Z     kernel = self.compile(
2025-05-07T20:31:48.5980777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5980950Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5981080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5981084Z 
2025-05-07T20:31:48.5981295Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50fd17f10>
2025-05-07T20:31:48.5982089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5982600Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc514dd2950>}
2025-05-07T20:31:48.5983358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5983557Z context = <triton._C.libtriton.ir.context object at 0x7fc50f5f80b0>
2025-05-07T20:31:48.5983561Z 
2025-05-07T20:31:48.5983725Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5984075Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5984180Z                            module_map=module_map)
2025-05-07T20:31:48.5984341Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5984439Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.5984514Z E       ^
2025-05-07T20:31:48.5984872Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5984876Z 
2025-05-07T20:31:48.5985298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5985303Z 
2025-05-07T20:31:48.5985405Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5985637Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5985711Z     T=128,
2025-05-07T20:31:48.5985785Z     D=5120,
2025-05-07T20:31:48.5985874Z     scale_ub=1200.0,
2025-05-07T20:31:48.5985956Z     contiguous=True,
2025-05-07T20:31:48.5986035Z     compiled=False,
2025-05-07T20:31:48.5986107Z )
2025-05-07T20:31:48.5986327Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5986497Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.5986505Z 
2025-05-07T20:31:48.5986580Z     @given(
2025-05-07T20:31:48.5986695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5986795Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5986908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5987023Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5987141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5987288Z     )
2025-05-07T20:31:48.5987535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.5987629Z     def test_silu_mul_quant(
2025-05-07T20:31:48.5987707Z         self,
2025-05-07T20:31:48.5987779Z         T: int,
2025-05-07T20:31:48.5987858Z         D: int,
2025-05-07T20:31:48.5987956Z         scale_ub: Optional[float],
2025-05-07T20:31:48.5988047Z         contiguous: bool,
2025-05-07T20:31:48.5988130Z         compiled: bool,
2025-05-07T20:31:48.5988206Z     ) -> None:
2025-05-07T20:31:48.5988300Z         torch.manual_seed(2025)
2025-05-07T20:31:48.5988370Z     
2025-05-07T20:31:48.5988539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.5988613Z     
2025-05-07T20:31:48.5988701Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.5988825Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.5988914Z         x = x_sign * x_clamp
2025-05-07T20:31:48.5988991Z         x0 = x[:, :D]
2025-05-07T20:31:48.5989073Z         x1 = x[:, D:]
2025-05-07T20:31:48.5989145Z     
2025-05-07T20:31:48.5989226Z         if contiguous:
2025-05-07T20:31:48.5989314Z             x0 = x0.contiguous()
2025-05-07T20:31:48.5989411Z             x1 = x1.contiguous()
2025-05-07T20:31:48.5989481Z     
2025-05-07T20:31:48.5989572Z         if scale_ub is not None:
2025-05-07T20:31:48.5989673Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.5989807Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.5989883Z             )
2025-05-07T20:31:48.5989955Z         else:
2025-05-07T20:31:48.5990046Z             scale_ub_tensor = None
2025-05-07T20:31:48.5990117Z     
2025-05-07T20:31:48.5990246Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.5990334Z             op = silu_mul_quant
2025-05-07T20:31:48.5990417Z             if compiled:
2025-05-07T20:31:48.5990514Z                 op = torch.compile(op)
2025-05-07T20:31:48.5990623Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5990697Z     
2025-05-07T20:31:48.5990784Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.5990789Z 
2025-05-07T20:31:48.5990973Z moe/activation_test.py:117: 
2025-05-07T20:31:48.5991098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5991196Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.5991298Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.5991804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.5991901Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.5992264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.5992484Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.5992838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.5992929Z     kernel = self.compile(
2025-05-07T20:31:48.5993317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.5993499Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.5993620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.5993624Z 
2025-05-07T20:31:48.5993832Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f5c28f0>
2025-05-07T20:31:48.5994622Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.5995206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc5144cc3a0>}
2025-05-07T20:31:48.5995993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.5996217Z context = <triton._C.libtriton.ir.context object at 0x7fc50f5d20f0>
2025-05-07T20:31:48.5996221Z 
2025-05-07T20:31:48.5996388Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.5996655Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.5996760Z                            module_map=module_map)
2025-05-07T20:31:48.5996928Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.5997024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.5997101Z E       ^
2025-05-07T20:31:48.5997468Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.5997473Z 
2025-05-07T20:31:48.5997892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.5997902Z 
2025-05-07T20:31:48.5998007Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.5998230Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.5998304Z     T=1,
2025-05-07T20:31:48.5998381Z     D=7168,
2025-05-07T20:31:48.5998461Z     scale_ub=1200.0,
2025-05-07T20:31:48.5998547Z     contiguous=True,
2025-05-07T20:31:48.5998626Z     compiled=True,
2025-05-07T20:31:48.5998696Z )
2025-05-07T20:31:48.5998917Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.5999081Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.5999085Z 
2025-05-07T20:31:48.5999158Z     @given(
2025-05-07T20:31:48.5999284Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.5999381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.5999495Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.5999693Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.5999806Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.5999880Z     )
2025-05-07T20:31:48.6000125Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6000218Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6000293Z         self,
2025-05-07T20:31:48.6000367Z         T: int,
2025-05-07T20:31:48.6000439Z         D: int,
2025-05-07T20:31:48.6000538Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6000625Z         contiguous: bool,
2025-05-07T20:31:48.6000708Z         compiled: bool,
2025-05-07T20:31:48.6000787Z     ) -> None:
2025-05-07T20:31:48.6000880Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6000949Z     
2025-05-07T20:31:48.6001125Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6001198Z     
2025-05-07T20:31:48.6001288Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6001417Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6001503Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6001581Z         x0 = x[:, :D]
2025-05-07T20:31:48.6001658Z         x1 = x[:, D:]
2025-05-07T20:31:48.6001726Z     
2025-05-07T20:31:48.6001812Z         if contiguous:
2025-05-07T20:31:48.6001907Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6001998Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6002070Z     
2025-05-07T20:31:48.6002159Z         if scale_ub is not None:
2025-05-07T20:31:48.6002262Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6002397Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6002468Z             )
2025-05-07T20:31:48.6002545Z         else:
2025-05-07T20:31:48.6002637Z             scale_ub_tensor = None
2025-05-07T20:31:48.6002806Z     
2025-05-07T20:31:48.6002942Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6003030Z             op = silu_mul_quant
2025-05-07T20:31:48.6003116Z             if compiled:
2025-05-07T20:31:48.6003215Z                 op = torch.compile(op)
2025-05-07T20:31:48.6003318Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6003388Z     
2025-05-07T20:31:48.6003481Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6003485Z 
2025-05-07T20:31:48.6003580Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6003705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6003805Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6003902Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6004281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6004372Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6004880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6004979Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6005345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6005567Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6005915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6006004Z     kernel = self.compile(
2025-05-07T20:31:48.6006420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6006617Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6006738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6006742Z 
2025-05-07T20:31:48.6006955Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f5c23e0>
2025-05-07T20:31:48.6007745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6008405Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc5144cedd0>}
2025-05-07T20:31:48.6009163Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6009358Z context = <triton._C.libtriton.ir.context object at 0x7fc50f5ca030>
2025-05-07T20:31:48.6009363Z 
2025-05-07T20:31:48.6009531Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6009798Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6009911Z                            module_map=module_map)
2025-05-07T20:31:48.6010070Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6010166Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6010242Z E       ^
2025-05-07T20:31:48.6010599Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6010604Z 
2025-05-07T20:31:48.6011028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6011033Z 
2025-05-07T20:31:48.6011135Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6011359Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6011436Z     T=1,
2025-05-07T20:31:48.6011586Z     D=7168,
2025-05-07T20:31:48.6011667Z     scale_ub=1200.0,
2025-05-07T20:31:48.6011756Z     contiguous=False,
2025-05-07T20:31:48.6011836Z     compiled=True,
2025-05-07T20:31:48.6011915Z )
2025-05-07T20:31:48.6012134Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6012303Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:48.6012312Z 
2025-05-07T20:31:48.6012386Z     @given(
2025-05-07T20:31:48.6012504Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6012602Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6012715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6012830Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6012944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6013014Z     )
2025-05-07T20:31:48.6013262Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6013362Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6013434Z         self,
2025-05-07T20:31:48.6013508Z         T: int,
2025-05-07T20:31:48.6013582Z         D: int,
2025-05-07T20:31:48.6013682Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6013773Z         contiguous: bool,
2025-05-07T20:31:48.6013856Z         compiled: bool,
2025-05-07T20:31:48.6013931Z     ) -> None:
2025-05-07T20:31:48.6014026Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6014095Z     
2025-05-07T20:31:48.6014264Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6014338Z     
2025-05-07T20:31:48.6014426Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6014550Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6014641Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6014717Z         x0 = x[:, :D]
2025-05-07T20:31:48.6014794Z         x1 = x[:, D:]
2025-05-07T20:31:48.6014866Z     
2025-05-07T20:31:48.6014946Z         if contiguous:
2025-05-07T20:31:48.6015042Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6015126Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6015196Z     
2025-05-07T20:31:48.6015373Z         if scale_ub is not None:
2025-05-07T20:31:48.6015476Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6015611Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6015686Z             )
2025-05-07T20:31:48.6015758Z         else:
2025-05-07T20:31:48.6015848Z             scale_ub_tensor = None
2025-05-07T20:31:48.6015919Z     
2025-05-07T20:31:48.6016047Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6016135Z             op = silu_mul_quant
2025-05-07T20:31:48.6016221Z             if compiled:
2025-05-07T20:31:48.6016319Z                 op = torch.compile(op)
2025-05-07T20:31:48.6016425Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6016494Z     
2025-05-07T20:31:48.6016581Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6016591Z 
2025-05-07T20:31:48.6016689Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6016814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6016918Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6017021Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6017394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6017485Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6018040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6018136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6018502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6018722Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6019147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6019242Z     kernel = self.compile(
2025-05-07T20:31:48.6019634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6019813Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6019934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6019938Z 
2025-05-07T20:31:48.6020144Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f5c0b20>
2025-05-07T20:31:48.6020935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6021447Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc5144cc0d0>}
2025-05-07T20:31:48.6022207Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6022403Z context = <triton._C.libtriton.ir.context object at 0x7fc50f7dfdf0>
2025-05-07T20:31:48.6022408Z 
2025-05-07T20:31:48.6022570Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6022841Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6022946Z                            module_map=module_map)
2025-05-07T20:31:48.6023110Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6023206Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6023279Z E       ^
2025-05-07T20:31:48.6023646Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6023650Z 
2025-05-07T20:31:48.6024069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6024153Z 
2025-05-07T20:31:48.6024258Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6024481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6024554Z     T=1,
2025-05-07T20:31:48.6024630Z     D=7168,
2025-05-07T20:31:48.6024708Z     scale_ub=None,
2025-05-07T20:31:48.6024791Z     contiguous=False,
2025-05-07T20:31:48.6024873Z     compiled=True,
2025-05-07T20:31:48.6024943Z )
2025-05-07T20:31:48.6025161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6025326Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:48.6025330Z 
2025-05-07T20:31:48.6025404Z     @given(
2025-05-07T20:31:48.6025529Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6025625Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6025743Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6025863Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6025977Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6026048Z     )
2025-05-07T20:31:48.6026295Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6026404Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6026485Z         self,
2025-05-07T20:31:48.6026574Z         T: int,
2025-05-07T20:31:48.6026655Z         D: int,
2025-05-07T20:31:48.6026751Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6026841Z         contiguous: bool,
2025-05-07T20:31:48.6026924Z         compiled: bool,
2025-05-07T20:31:48.6027003Z     ) -> None:
2025-05-07T20:31:48.6027096Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6027244Z     
2025-05-07T20:31:48.6027416Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6027487Z     
2025-05-07T20:31:48.6027580Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6027705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6027790Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6027867Z         x0 = x[:, :D]
2025-05-07T20:31:48.6027946Z         x1 = x[:, D:]
2025-05-07T20:31:48.6028014Z     
2025-05-07T20:31:48.6028095Z         if contiguous:
2025-05-07T20:31:48.6028185Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6028272Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6028344Z     
2025-05-07T20:31:48.6028432Z         if scale_ub is not None:
2025-05-07T20:31:48.6028535Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6028672Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6028745Z             )
2025-05-07T20:31:48.6028817Z         else:
2025-05-07T20:31:48.6028915Z             scale_ub_tensor = None
2025-05-07T20:31:48.6028986Z     
2025-05-07T20:31:48.6029116Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6029213Z             op = silu_mul_quant
2025-05-07T20:31:48.6029293Z             if compiled:
2025-05-07T20:31:48.6029389Z                 op = torch.compile(op)
2025-05-07T20:31:48.6029500Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6029573Z     
2025-05-07T20:31:48.6029661Z         y_fp8, y_scale = fn()
2025-05-07T20:31:48.6029782Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:48.6029853Z     
2025-05-07T20:31:48.6029989Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6030087Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:48.6030183Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:48.6030306Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:48.6030449Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.6030518Z     
2025-05-07T20:31:48.6030617Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:48.6030622Z 
2025-05-07T20:31:48.6030800Z moe/activation_test.py:126: 
2025-05-07T20:31:48.6030928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6031032Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:48.6031166Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:48.6031737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:48.6031836Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:48.6032199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6032423Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6032799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:48.6033057Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.6033466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:48.6033718Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:48.6034102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:48.6034269Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:48.6034618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:48.6034691Z     fn()
2025-05-07T20:31:48.6035200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:48.6035285Z     self.fn.run(
2025-05-07T20:31:48.6035627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6035722Z     kernel = self.compile(
2025-05-07T20:31:48.6036111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6036286Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6036417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6036421Z 
2025-05-07T20:31:48.6036641Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f751450>
2025-05-07T20:31:48.6037472Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6037986Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc5144ce440>}
2025-05-07T20:31:48.6038750Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6038945Z context = <triton._C.libtriton.ir.context object at 0x7fc50f7aaa70>
2025-05-07T20:31:48.6038950Z 
2025-05-07T20:31:48.6039113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6039380Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6039487Z                            module_map=module_map)
2025-05-07T20:31:48.6039647Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6039755Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:48.6039827Z E       ^
2025-05-07T20:31:48.6040186Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6040271Z 
2025-05-07T20:31:48.6040698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6040702Z 
2025-05-07T20:31:48.6040804Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6041031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6041104Z     T=1,
2025-05-07T20:31:48.6041177Z     D=5120,
2025-05-07T20:31:48.6041261Z     scale_ub=1200.0,
2025-05-07T20:31:48.6041344Z     contiguous=False,
2025-05-07T20:31:48.6041425Z     compiled=True,
2025-05-07T20:31:48.6041498Z )
2025-05-07T20:31:48.6041717Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6041886Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:48.6041891Z 
2025-05-07T20:31:48.6041967Z     @given(
2025-05-07T20:31:48.6042083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6042189Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6042304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6042420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6042534Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6042603Z     )
2025-05-07T20:31:48.6042852Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6042947Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6043021Z         self,
2025-05-07T20:31:48.6043094Z         T: int,
2025-05-07T20:31:48.6043170Z         D: int,
2025-05-07T20:31:48.6043267Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6043355Z         contiguous: bool,
2025-05-07T20:31:48.6043440Z         compiled: bool,
2025-05-07T20:31:48.6043598Z     ) -> None:
2025-05-07T20:31:48.6043700Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6043771Z     
2025-05-07T20:31:48.6043941Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6044021Z     
2025-05-07T20:31:48.6044111Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6044235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6044325Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6044404Z         x0 = x[:, :D]
2025-05-07T20:31:48.6044483Z         x1 = x[:, D:]
2025-05-07T20:31:48.6044556Z     
2025-05-07T20:31:48.6044640Z         if contiguous:
2025-05-07T20:31:48.6044731Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6044820Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6044890Z     
2025-05-07T20:31:48.6044980Z         if scale_ub is not None:
2025-05-07T20:31:48.6045086Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6045225Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6045301Z             )
2025-05-07T20:31:48.6045375Z         else:
2025-05-07T20:31:48.6045466Z             scale_ub_tensor = None
2025-05-07T20:31:48.6045545Z     
2025-05-07T20:31:48.6045675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6045763Z             op = silu_mul_quant
2025-05-07T20:31:48.6045848Z             if compiled:
2025-05-07T20:31:48.6045947Z                 op = torch.compile(op)
2025-05-07T20:31:48.6046051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6046124Z     
2025-05-07T20:31:48.6046213Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6046218Z 
2025-05-07T20:31:48.6046319Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6046446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6046546Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6046647Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6047026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6047117Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6047625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6047802Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6048170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6048394Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6048739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6048833Z     kernel = self.compile(
2025-05-07T20:31:48.6049222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6049401Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6049528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6049538Z 
2025-05-07T20:31:48.6049746Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f676c50>
2025-05-07T20:31:48.6050540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6051049Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50feca5f0>}
2025-05-07T20:31:48.6051814Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6052083Z context = <triton._C.libtriton.ir.context object at 0x7fc50f35f170>
2025-05-07T20:31:48.6052088Z 
2025-05-07T20:31:48.6052255Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6052530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6052640Z                            module_map=module_map)
2025-05-07T20:31:48.6052805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6052905Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6052980Z E       ^
2025-05-07T20:31:48.6053341Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6053345Z 
2025-05-07T20:31:48.6053766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6053770Z 
2025-05-07T20:31:48.6053872Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6054104Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6054180Z     T=1,
2025-05-07T20:31:48.6054261Z     D=5120,
2025-05-07T20:31:48.6054342Z     scale_ub=1200.0,
2025-05-07T20:31:48.6054426Z     contiguous=False,
2025-05-07T20:31:48.6054510Z     compiled=False,
2025-05-07T20:31:48.6054580Z )
2025-05-07T20:31:48.6054800Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6054972Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.6054976Z 
2025-05-07T20:31:48.6055049Z     @given(
2025-05-07T20:31:48.6055166Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6055265Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6055382Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6055500Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6055859Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6055969Z     )
2025-05-07T20:31:48.6056244Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6056561Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6056647Z         self,
2025-05-07T20:31:48.6056742Z         T: int,
2025-05-07T20:31:48.6056815Z         D: int,
2025-05-07T20:31:48.6056911Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6057003Z         contiguous: bool,
2025-05-07T20:31:48.6057086Z         compiled: bool,
2025-05-07T20:31:48.6057165Z     ) -> None:
2025-05-07T20:31:48.6057264Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6057335Z     
2025-05-07T20:31:48.6057509Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6057581Z     
2025-05-07T20:31:48.6057676Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6057804Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6057891Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6057972Z         x0 = x[:, :D]
2025-05-07T20:31:48.6058103Z         x1 = x[:, D:]
2025-05-07T20:31:48.6058176Z     
2025-05-07T20:31:48.6058258Z         if contiguous:
2025-05-07T20:31:48.6058358Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6058444Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6058516Z     
2025-05-07T20:31:48.6058608Z         if scale_ub is not None:
2025-05-07T20:31:48.6058712Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6058847Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6058923Z             )
2025-05-07T20:31:48.6058997Z         else:
2025-05-07T20:31:48.6059094Z             scale_ub_tensor = None
2025-05-07T20:31:48.6059166Z     
2025-05-07T20:31:48.6059296Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6059387Z             op = silu_mul_quant
2025-05-07T20:31:48.6059469Z             if compiled:
2025-05-07T20:31:48.6059566Z                 op = torch.compile(op)
2025-05-07T20:31:48.6059792Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6059866Z     
2025-05-07T20:31:48.6059955Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6059965Z 
2025-05-07T20:31:48.6060068Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6060194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6060300Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6060399Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6060910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6061009Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6061374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6061597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6061948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6062040Z     kernel = self.compile(
2025-05-07T20:31:48.6062436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6062611Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6062734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6062739Z 
2025-05-07T20:31:48.6062948Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f674790>
2025-05-07T20:31:48.6063740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6064262Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50feca050>}
2025-05-07T20:31:48.6065025Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6065326Z context = <triton._C.libtriton.ir.context object at 0x7fc50f337b30>
2025-05-07T20:31:48.6065334Z 
2025-05-07T20:31:48.6065499Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6065767Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6065875Z                            module_map=module_map)
2025-05-07T20:31:48.6066035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6066133Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6066210Z E       ^
2025-05-07T20:31:48.6066573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6066577Z 
2025-05-07T20:31:48.6067051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6067061Z 
2025-05-07T20:31:48.6067162Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6067386Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6067467Z     T=16384,
2025-05-07T20:31:48.6067542Z     D=5120,
2025-05-07T20:31:48.6067623Z     scale_ub=1200.0,
2025-05-07T20:31:48.6067710Z     contiguous=False,
2025-05-07T20:31:48.6067790Z     compiled=True,
2025-05-07T20:31:48.6067860Z )
2025-05-07T20:31:48.6068083Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6068261Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:48.6068265Z 
2025-05-07T20:31:48.6068343Z     @given(
2025-05-07T20:31:48.6068540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6068639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6068756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6068878Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6068993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6069066Z     )
2025-05-07T20:31:48.6069314Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6069409Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6069482Z         self,
2025-05-07T20:31:48.6069556Z         T: int,
2025-05-07T20:31:48.6069634Z         D: int,
2025-05-07T20:31:48.6069730Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6069817Z         contiguous: bool,
2025-05-07T20:31:48.6069902Z         compiled: bool,
2025-05-07T20:31:48.6069983Z     ) -> None:
2025-05-07T20:31:48.6070078Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6070152Z     
2025-05-07T20:31:48.6070326Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6070402Z     
2025-05-07T20:31:48.6070494Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6070622Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6070709Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6070789Z         x0 = x[:, :D]
2025-05-07T20:31:48.6070866Z         x1 = x[:, D:]
2025-05-07T20:31:48.6070943Z     
2025-05-07T20:31:48.6071026Z         if contiguous:
2025-05-07T20:31:48.6071114Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6071204Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6075453Z     
2025-05-07T20:31:48.6075560Z         if scale_ub is not None:
2025-05-07T20:31:48.6075677Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6075840Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6075914Z             )
2025-05-07T20:31:48.6076006Z         else:
2025-05-07T20:31:48.6076111Z             scale_ub_tensor = None
2025-05-07T20:31:48.6076179Z     
2025-05-07T20:31:48.6076314Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6076507Z             op = silu_mul_quant
2025-05-07T20:31:48.6076589Z             if compiled:
2025-05-07T20:31:48.6076691Z                 op = torch.compile(op)
2025-05-07T20:31:48.6076796Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6076870Z     
2025-05-07T20:31:48.6076959Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6076964Z 
2025-05-07T20:31:48.6077062Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6077196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6077299Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6077399Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6077782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6077877Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6078385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6078491Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6078857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6079086Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6079434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6079526Z     kernel = self.compile(
2025-05-07T20:31:48.6079918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6080094Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6080300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6080305Z 
2025-05-07T20:31:48.6080519Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f7176d0>
2025-05-07T20:31:48.6081323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6081840Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50fec83a0>}
2025-05-07T20:31:48.6082605Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6082802Z context = <triton._C.libtriton.ir.context object at 0x7fc50f3c32b0>
2025-05-07T20:31:48.6082807Z 
2025-05-07T20:31:48.6082978Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6083247Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6083362Z                            module_map=module_map)
2025-05-07T20:31:48.6083527Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6083629Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6083704Z E       ^
2025-05-07T20:31:48.6084065Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6084070Z 
2025-05-07T20:31:48.6084497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6084502Z 
2025-05-07T20:31:48.6084605Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6084833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6084912Z     T=2048,
2025-05-07T20:31:48.6084985Z     D=7168,
2025-05-07T20:31:48.6085070Z     scale_ub=1200.0,
2025-05-07T20:31:48.6085155Z     contiguous=False,
2025-05-07T20:31:48.6085319Z     compiled=True,
2025-05-07T20:31:48.6085395Z )
2025-05-07T20:31:48.6085618Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6085794Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:48.6085799Z 
2025-05-07T20:31:48.6085875Z     @given(
2025-05-07T20:31:48.6085994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6086094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6086209Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6086326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6086461Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6086537Z     )
2025-05-07T20:31:48.6086816Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6086915Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6086987Z         self,
2025-05-07T20:31:48.6087065Z         T: int,
2025-05-07T20:31:48.6087144Z         D: int,
2025-05-07T20:31:48.6087242Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6087330Z         contiguous: bool,
2025-05-07T20:31:48.6087418Z         compiled: bool,
2025-05-07T20:31:48.6087495Z     ) -> None:
2025-05-07T20:31:48.6087594Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6087664Z     
2025-05-07T20:31:48.6087836Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6087911Z     
2025-05-07T20:31:48.6088004Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6088131Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6088225Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6088302Z         x0 = x[:, :D]
2025-05-07T20:31:48.6088379Z         x1 = x[:, D:]
2025-05-07T20:31:48.6088452Z     
2025-05-07T20:31:48.6088614Z         if contiguous:
2025-05-07T20:31:48.6088706Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6088797Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6088872Z     
2025-05-07T20:31:48.6088969Z         if scale_ub is not None:
2025-05-07T20:31:48.6089076Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6089212Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6089288Z             )
2025-05-07T20:31:48.6089362Z         else:
2025-05-07T20:31:48.6089455Z             scale_ub_tensor = None
2025-05-07T20:31:48.6089527Z     
2025-05-07T20:31:48.6089659Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6089748Z             op = silu_mul_quant
2025-05-07T20:31:48.6089833Z             if compiled:
2025-05-07T20:31:48.6089936Z                 op = torch.compile(op)
2025-05-07T20:31:48.6090041Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6090113Z     
2025-05-07T20:31:48.6090207Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6090211Z 
2025-05-07T20:31:48.6090311Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6090438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6090542Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6090647Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6091024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6091116Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6091624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6091721Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6092088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6092317Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6092663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6092842Z     kernel = self.compile(
2025-05-07T20:31:48.6093229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6093408Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6093536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6093541Z 
2025-05-07T20:31:48.6093749Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f71ae90>
2025-05-07T20:31:48.6094546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6095063Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50feca200>}
2025-05-07T20:31:48.6095834Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6096031Z context = <triton._C.libtriton.ir.context object at 0x7fc50f0a4a30>
2025-05-07T20:31:48.6096036Z 
2025-05-07T20:31:48.6096201Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6096472Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6096587Z                            module_map=module_map)
2025-05-07T20:31:48.6096776Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6096892Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6096966Z E       ^
2025-05-07T20:31:48.6097404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6097419Z 
2025-05-07T20:31:48.6097844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6097848Z 
2025-05-07T20:31:48.6097951Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6098253Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6098328Z     T=1,
2025-05-07T20:31:48.6098401Z     D=5120,
2025-05-07T20:31:48.6098485Z     scale_ub=None,
2025-05-07T20:31:48.6098569Z     contiguous=False,
2025-05-07T20:31:48.6098651Z     compiled=False,
2025-05-07T20:31:48.6098723Z )
2025-05-07T20:31:48.6098943Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6099116Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.6099120Z 
2025-05-07T20:31:48.6099201Z     @given(
2025-05-07T20:31:48.6099320Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6099421Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6099541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6099658Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6099775Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6099847Z     )
2025-05-07T20:31:48.6100096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6100191Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6100265Z         self,
2025-05-07T20:31:48.6100341Z         T: int,
2025-05-07T20:31:48.6100414Z         D: int,
2025-05-07T20:31:48.6100512Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6100603Z         contiguous: bool,
2025-05-07T20:31:48.6100687Z         compiled: bool,
2025-05-07T20:31:48.6100765Z     ) -> None:
2025-05-07T20:31:48.6100865Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6100936Z     
2025-05-07T20:31:48.6101108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6101296Z     
2025-05-07T20:31:48.6101387Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6101511Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6101599Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6101676Z         x0 = x[:, :D]
2025-05-07T20:31:48.6101757Z         x1 = x[:, D:]
2025-05-07T20:31:48.6101827Z     
2025-05-07T20:31:48.6101908Z         if contiguous:
2025-05-07T20:31:48.6102000Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6102088Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6102159Z     
2025-05-07T20:31:48.6102252Z         if scale_ub is not None:
2025-05-07T20:31:48.6102356Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6102492Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6102567Z             )
2025-05-07T20:31:48.6102645Z         else:
2025-05-07T20:31:48.6102742Z             scale_ub_tensor = None
2025-05-07T20:31:48.6102817Z     
2025-05-07T20:31:48.6102948Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6103044Z             op = silu_mul_quant
2025-05-07T20:31:48.6103127Z             if compiled:
2025-05-07T20:31:48.6103225Z                 op = torch.compile(op)
2025-05-07T20:31:48.6103334Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6103403Z     
2025-05-07T20:31:48.6103491Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6103496Z 
2025-05-07T20:31:48.6103594Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6103722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6103822Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6103923Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6104512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6104613Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6104978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6105206Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6105556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6105649Z     kernel = self.compile(
2025-05-07T20:31:48.6106039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6106217Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6106342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6106346Z 
2025-05-07T20:31:48.6106562Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f0cd840>
2025-05-07T20:31:48.6107354Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6107878Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50fecb490>}
2025-05-07T20:31:48.6108641Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6108836Z context = <triton._C.libtriton.ir.context object at 0x7fc50f01d330>
2025-05-07T20:31:48.6108840Z 
2025-05-07T20:31:48.6109008Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6109282Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6109391Z                            module_map=module_map)
2025-05-07T20:31:48.6109635Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6109732Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6109810Z E       ^
2025-05-07T20:31:48.6110171Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6110175Z 
2025-05-07T20:31:48.6110597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6110607Z 
2025-05-07T20:31:48.6110709Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6110934Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6111011Z     T=4096,
2025-05-07T20:31:48.6111083Z     D=7168,
2025-05-07T20:31:48.6111163Z     scale_ub=1200.0,
2025-05-07T20:31:48.6111256Z     contiguous=False,
2025-05-07T20:31:48.6111339Z     compiled=False,
2025-05-07T20:31:48.6111409Z )
2025-05-07T20:31:48.6111633Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6111814Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.6111818Z 
2025-05-07T20:31:48.6111892Z     @given(
2025-05-07T20:31:48.6112018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6112115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6112232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6112350Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6112464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6112537Z     )
2025-05-07T20:31:48.6112787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6112880Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6113034Z         self,
2025-05-07T20:31:48.6113110Z         T: int,
2025-05-07T20:31:48.6113183Z         D: int,
2025-05-07T20:31:48.6113283Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6113377Z         contiguous: bool,
2025-05-07T20:31:48.6113463Z         compiled: bool,
2025-05-07T20:31:48.6113539Z     ) -> None:
2025-05-07T20:31:48.6113632Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6113705Z     
2025-05-07T20:31:48.6113880Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6113951Z     
2025-05-07T20:31:48.6114044Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6114169Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6114254Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6114336Z         x0 = x[:, :D]
2025-05-07T20:31:48.6114414Z         x1 = x[:, D:]
2025-05-07T20:31:48.6114484Z     
2025-05-07T20:31:48.6114567Z         if contiguous:
2025-05-07T20:31:48.6114656Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6114750Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6114824Z     
2025-05-07T20:31:48.6114914Z         if scale_ub is not None:
2025-05-07T20:31:48.6115021Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6115160Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6115231Z             )
2025-05-07T20:31:48.6115307Z         else:
2025-05-07T20:31:48.6115398Z             scale_ub_tensor = None
2025-05-07T20:31:48.6115469Z     
2025-05-07T20:31:48.6115604Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6115690Z             op = silu_mul_quant
2025-05-07T20:31:48.6115772Z             if compiled:
2025-05-07T20:31:48.6115873Z                 op = torch.compile(op)
2025-05-07T20:31:48.6115978Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6116047Z     
2025-05-07T20:31:48.6116140Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6116144Z 
2025-05-07T20:31:48.6116240Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6116374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6116475Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6116680Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6117221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6117316Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6117682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6117907Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6118253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6118346Z     kernel = self.compile(
2025-05-07T20:31:48.6118739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6118917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6119043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6119053Z 
2025-05-07T20:31:48.6119259Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f756ad0>
2025-05-07T20:31:48.6120053Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6120565Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f61c550>}
2025-05-07T20:31:48.6121404Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6121605Z context = <triton._C.libtriton.ir.context object at 0x7fc50f6e6630>
2025-05-07T20:31:48.6121614Z 
2025-05-07T20:31:48.6121781Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6122051Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6122157Z                            module_map=module_map)
2025-05-07T20:31:48.6122318Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6122419Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6122492Z E       ^
2025-05-07T20:31:48.6122857Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6122861Z 
2025-05-07T20:31:48.6123282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6123291Z 
2025-05-07T20:31:48.6123395Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6123625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6123704Z     T=16384,
2025-05-07T20:31:48.6123778Z     D=7168,
2025-05-07T20:31:48.6123860Z     scale_ub=None,
2025-05-07T20:31:48.6123942Z     contiguous=True,
2025-05-07T20:31:48.6124026Z     compiled=True,
2025-05-07T20:31:48.6124097Z )
2025-05-07T20:31:48.6124317Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6124494Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.6124498Z 
2025-05-07T20:31:48.6124571Z     @given(
2025-05-07T20:31:48.6124690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6124791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6124906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6125029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6125145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6125215Z     )
2025-05-07T20:31:48.6125550Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6125643Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6125717Z         self,
2025-05-07T20:31:48.6125794Z         T: int,
2025-05-07T20:31:48.6125867Z         D: int,
2025-05-07T20:31:48.6125964Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6126055Z         contiguous: bool,
2025-05-07T20:31:48.6126137Z         compiled: bool,
2025-05-07T20:31:48.6126215Z     ) -> None:
2025-05-07T20:31:48.6126312Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6126383Z     
2025-05-07T20:31:48.6126556Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6126629Z     
2025-05-07T20:31:48.6126730Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6126880Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6126988Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6127069Z         x0 = x[:, :D]
2025-05-07T20:31:48.6127151Z         x1 = x[:, D:]
2025-05-07T20:31:48.6127227Z     
2025-05-07T20:31:48.6127310Z         if contiguous:
2025-05-07T20:31:48.6127402Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6127490Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6127560Z     
2025-05-07T20:31:48.6127653Z         if scale_ub is not None:
2025-05-07T20:31:48.6127758Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6127893Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6127968Z             )
2025-05-07T20:31:48.6128040Z         else:
2025-05-07T20:31:48.6128133Z             scale_ub_tensor = None
2025-05-07T20:31:48.6128206Z     
2025-05-07T20:31:48.6128335Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6128426Z             op = silu_mul_quant
2025-05-07T20:31:48.6128510Z             if compiled:
2025-05-07T20:31:48.6128713Z                 op = torch.compile(op)
2025-05-07T20:31:48.6128823Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6128896Z     
2025-05-07T20:31:48.6128986Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6128991Z 
2025-05-07T20:31:48.6129091Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6129219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6129320Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6129422Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6129797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6129895Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6130398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6130493Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6130863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6131086Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6131437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6131527Z     kernel = self.compile(
2025-05-07T20:31:48.6131916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6132093Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6132217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6132221Z 
2025-05-07T20:31:48.6132428Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f7568c0>
2025-05-07T20:31:48.6133227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6133820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f61d360>}
2025-05-07T20:31:48.6134586Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6134779Z context = <triton._C.libtriton.ir.context object at 0x7fc50f64e930>
2025-05-07T20:31:48.6134784Z 
2025-05-07T20:31:48.6134950Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6135218Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6135328Z                            module_map=module_map)
2025-05-07T20:31:48.6135498Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6135596Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6135674Z E       ^
2025-05-07T20:31:48.6136035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6136040Z 
2025-05-07T20:31:48.6136463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6136468Z 
2025-05-07T20:31:48.6136571Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6136798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6136875Z     T=4096,
2025-05-07T20:31:48.6136950Z     D=5120,
2025-05-07T20:31:48.6137029Z     scale_ub=None,
2025-05-07T20:31:48.6137118Z     contiguous=False,
2025-05-07T20:31:48.6137199Z     compiled=True,
2025-05-07T20:31:48.6137269Z )
2025-05-07T20:31:48.6137568Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6137743Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:48.6137753Z 
2025-05-07T20:31:48.6137828Z     @given(
2025-05-07T20:31:48.6137945Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6138096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6138215Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6138333Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6138446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6138521Z     )
2025-05-07T20:31:48.6138769Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6138862Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6138939Z         self,
2025-05-07T20:31:48.6139014Z         T: int,
2025-05-07T20:31:48.6139088Z         D: int,
2025-05-07T20:31:48.6139193Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6139282Z         contiguous: bool,
2025-05-07T20:31:48.6139374Z         compiled: bool,
2025-05-07T20:31:48.6139450Z     ) -> None:
2025-05-07T20:31:48.6139547Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6139621Z     
2025-05-07T20:31:48.6139793Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6139865Z     
2025-05-07T20:31:48.6139961Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6140084Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6140171Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6140251Z         x0 = x[:, :D]
2025-05-07T20:31:48.6140327Z         x1 = x[:, D:]
2025-05-07T20:31:48.6140398Z     
2025-05-07T20:31:48.6140483Z         if contiguous:
2025-05-07T20:31:48.6140572Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6140662Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6140732Z     
2025-05-07T20:31:48.6140821Z         if scale_ub is not None:
2025-05-07T20:31:48.6140932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6141066Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6141282Z             )
2025-05-07T20:31:48.6141361Z         else:
2025-05-07T20:31:48.6141455Z             scale_ub_tensor = None
2025-05-07T20:31:48.6141525Z     
2025-05-07T20:31:48.6141657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6141746Z             op = silu_mul_quant
2025-05-07T20:31:48.6141828Z             if compiled:
2025-05-07T20:31:48.6141929Z                 op = torch.compile(op)
2025-05-07T20:31:48.6142033Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6142102Z     
2025-05-07T20:31:48.6142196Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6142201Z 
2025-05-07T20:31:48.6142296Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6142425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6142528Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6142628Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6143007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6143106Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6143610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6143709Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6144072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6144301Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6144647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6144739Z     kernel = self.compile(
2025-05-07T20:31:48.6145209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6145387Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6145519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6145524Z 
2025-05-07T20:31:48.6145732Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f08d270>
2025-05-07T20:31:48.6146574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6147091Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f61dea0>}
2025-05-07T20:31:48.6147856Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6148052Z context = <triton._C.libtriton.ir.context object at 0x7fc50f150270>
2025-05-07T20:31:48.6148061Z 
2025-05-07T20:31:48.6148227Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6148494Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6148603Z                            module_map=module_map)
2025-05-07T20:31:48.6148766Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6148864Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6148938Z E       ^
2025-05-07T20:31:48.6149297Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6149302Z 
2025-05-07T20:31:48.6149729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6149733Z 
2025-05-07T20:31:48.6149837Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6150143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6150218Z     T=4096,
2025-05-07T20:31:48.6150290Z     D=5120,
2025-05-07T20:31:48.6150373Z     scale_ub=1200.0,
2025-05-07T20:31:48.6150457Z     contiguous=False,
2025-05-07T20:31:48.6150538Z     compiled=False,
2025-05-07T20:31:48.6150613Z )
2025-05-07T20:31:48.6150833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6151008Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.6151013Z 
2025-05-07T20:31:48.6151090Z     @given(
2025-05-07T20:31:48.6151210Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6151311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6151429Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6151546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6151664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6151739Z     )
2025-05-07T20:31:48.6151987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6152081Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6152154Z         self,
2025-05-07T20:31:48.6152228Z         T: int,
2025-05-07T20:31:48.6152305Z         D: int,
2025-05-07T20:31:48.6152402Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6152489Z         contiguous: bool,
2025-05-07T20:31:48.6152576Z         compiled: bool,
2025-05-07T20:31:48.6152653Z     ) -> None:
2025-05-07T20:31:48.6152749Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6152819Z     
2025-05-07T20:31:48.6152990Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6153067Z     
2025-05-07T20:31:48.6153155Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6153358Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6153448Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6153530Z         x0 = x[:, :D]
2025-05-07T20:31:48.6153607Z         x1 = x[:, D:]
2025-05-07T20:31:48.6153680Z     
2025-05-07T20:31:48.6153761Z         if contiguous:
2025-05-07T20:31:48.6153852Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6153941Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6154014Z     
2025-05-07T20:31:48.6154104Z         if scale_ub is not None:
2025-05-07T20:31:48.6154210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6154344Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6154420Z             )
2025-05-07T20:31:48.6154494Z         else:
2025-05-07T20:31:48.6154585Z             scale_ub_tensor = None
2025-05-07T20:31:48.6154658Z     
2025-05-07T20:31:48.6154788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6154881Z             op = silu_mul_quant
2025-05-07T20:31:48.6154966Z             if compiled:
2025-05-07T20:31:48.6155063Z                 op = torch.compile(op)
2025-05-07T20:31:48.6155173Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6155245Z     
2025-05-07T20:31:48.6155334Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6155338Z 
2025-05-07T20:31:48.6155438Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6155768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6155921Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6156054Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6156592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6156701Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6157083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6157312Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6157665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6157920Z     kernel = self.compile(
2025-05-07T20:31:48.6158312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6158491Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6158613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6158618Z 
2025-05-07T20:31:48.6158823Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f6d1780>
2025-05-07T20:31:48.6159623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6160138Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f61e680>}
2025-05-07T20:31:48.6160909Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6161104Z context = <triton._C.libtriton.ir.context object at 0x7fc50f1e7fb0>
2025-05-07T20:31:48.6161108Z 
2025-05-07T20:31:48.6161275Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6161542Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6161647Z                            module_map=module_map)
2025-05-07T20:31:48.6161811Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6162043Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6162121Z E       ^
2025-05-07T20:31:48.6162484Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6162495Z 
2025-05-07T20:31:48.6162924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6162930Z 
2025-05-07T20:31:48.6163037Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6163264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6163340Z     T=4096,
2025-05-07T20:31:48.6163417Z     D=5120,
2025-05-07T20:31:48.6163497Z     scale_ub=1200.0,
2025-05-07T20:31:48.6163582Z     contiguous=False,
2025-05-07T20:31:48.6163666Z     compiled=True,
2025-05-07T20:31:48.6163737Z )
2025-05-07T20:31:48.6163960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6164137Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:48.6164142Z 
2025-05-07T20:31:48.6164216Z     @given(
2025-05-07T20:31:48.6164337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6164438Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6164554Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6164674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6164787Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6164861Z     )
2025-05-07T20:31:48.6165108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6165202Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6165279Z         self,
2025-05-07T20:31:48.6165354Z         T: int,
2025-05-07T20:31:48.6165434Z         D: int,
2025-05-07T20:31:48.6165534Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6165620Z         contiguous: bool,
2025-05-07T20:31:48.6165715Z         compiled: bool,
2025-05-07T20:31:48.6165812Z     ) -> None:
2025-05-07T20:31:48.6165915Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6166001Z     
2025-05-07T20:31:48.6166258Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6166329Z     
2025-05-07T20:31:48.6166419Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6166545Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6166631Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6166712Z         x0 = x[:, :D]
2025-05-07T20:31:48.6166789Z         x1 = x[:, D:]
2025-05-07T20:31:48.6166858Z     
2025-05-07T20:31:48.6166944Z         if contiguous:
2025-05-07T20:31:48.6167033Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6167119Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6167193Z     
2025-05-07T20:31:48.6167283Z         if scale_ub is not None:
2025-05-07T20:31:48.6167388Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6167531Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6167603Z             )
2025-05-07T20:31:48.6167677Z         else:
2025-05-07T20:31:48.6167771Z             scale_ub_tensor = None
2025-05-07T20:31:48.6167846Z     
2025-05-07T20:31:48.6167979Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6168069Z             op = silu_mul_quant
2025-05-07T20:31:48.6168152Z             if compiled:
2025-05-07T20:31:48.6168251Z                 op = torch.compile(op)
2025-05-07T20:31:48.6168356Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6168425Z     
2025-05-07T20:31:48.6168515Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6168519Z 
2025-05-07T20:31:48.6168615Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6168740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6168844Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6168942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6169398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6169494Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6170003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6170103Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6170465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6170687Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6171034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6171125Z     kernel = self.compile(
2025-05-07T20:31:48.6171515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6171694Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6171819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6171828Z 
2025-05-07T20:31:48.6172039Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f147790>
2025-05-07T20:31:48.6172829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6173340Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f61fac0>}
2025-05-07T20:31:48.6174099Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6174295Z context = <triton._C.libtriton.ir.context object at 0x7fc50f1bfcb0>
2025-05-07T20:31:48.6174303Z 
2025-05-07T20:31:48.6174469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6174816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6174926Z                            module_map=module_map)
2025-05-07T20:31:48.6175088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6175184Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6175260Z E       ^
2025-05-07T20:31:48.6175618Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6175623Z 
2025-05-07T20:31:48.6176046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6176051Z 
2025-05-07T20:31:48.6176158Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6176407Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6176498Z     T=2048,
2025-05-07T20:31:48.6176590Z     D=7168,
2025-05-07T20:31:48.6176675Z     scale_ub=1200.0,
2025-05-07T20:31:48.6176761Z     contiguous=False,
2025-05-07T20:31:48.6176844Z     compiled=False,
2025-05-07T20:31:48.6176917Z )
2025-05-07T20:31:48.6177142Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6177316Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.6177320Z 
2025-05-07T20:31:48.6177399Z     @given(
2025-05-07T20:31:48.6177516Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6177613Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6177731Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6177847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6178095Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6178172Z     )
2025-05-07T20:31:48.6178419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6178521Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6178595Z         self,
2025-05-07T20:31:48.6178669Z         T: int,
2025-05-07T20:31:48.6178744Z         D: int,
2025-05-07T20:31:48.6178840Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6178927Z         contiguous: bool,
2025-05-07T20:31:48.6179012Z         compiled: bool,
2025-05-07T20:31:48.6179089Z     ) -> None:
2025-05-07T20:31:48.6179182Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6179257Z     
2025-05-07T20:31:48.6179426Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6179497Z     
2025-05-07T20:31:48.6179590Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6179714Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6179804Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6179888Z         x0 = x[:, :D]
2025-05-07T20:31:48.6179964Z         x1 = x[:, D:]
2025-05-07T20:31:48.6180040Z     
2025-05-07T20:31:48.6180125Z         if contiguous:
2025-05-07T20:31:48.6180216Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6180303Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6180374Z     
2025-05-07T20:31:48.6180463Z         if scale_ub is not None:
2025-05-07T20:31:48.6180571Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6180704Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6180777Z             )
2025-05-07T20:31:48.6180855Z         else:
2025-05-07T20:31:48.6180947Z             scale_ub_tensor = None
2025-05-07T20:31:48.6181017Z     
2025-05-07T20:31:48.6181149Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6181238Z             op = silu_mul_quant
2025-05-07T20:31:48.6181326Z             if compiled:
2025-05-07T20:31:48.6181428Z                 op = torch.compile(op)
2025-05-07T20:31:48.6181534Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6181607Z     
2025-05-07T20:31:48.6181694Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6181781Z 
2025-05-07T20:31:48.6181878Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6182008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6182108Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6182207Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6182718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6182814Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6183186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6183409Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6183759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6183854Z     kernel = self.compile(
2025-05-07T20:31:48.6184250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6184425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6184549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6184554Z 
2025-05-07T20:31:48.6184758Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f1ccfd0>
2025-05-07T20:31:48.6185550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6186135Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc5144ce200>}
2025-05-07T20:31:48.6186952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6187150Z context = <triton._C.libtriton.ir.context object at 0x7fc50f2031f0>
2025-05-07T20:31:48.6187155Z 
2025-05-07T20:31:48.6187321Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6187588Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6187695Z                            module_map=module_map)
2025-05-07T20:31:48.6187857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6187954Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6188027Z E       ^
2025-05-07T20:31:48.6188392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6188397Z 
2025-05-07T20:31:48.6188818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6188828Z 
2025-05-07T20:31:48.6188933Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6189158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6189232Z     T=1,
2025-05-07T20:31:48.6189308Z     D=7168,
2025-05-07T20:31:48.6189387Z     scale_ub=None,
2025-05-07T20:31:48.6189470Z     contiguous=True,
2025-05-07T20:31:48.6189554Z     compiled=False,
2025-05-07T20:31:48.6189625Z )
2025-05-07T20:31:48.6189845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6190011Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.6190016Z 
2025-05-07T20:31:48.6190091Z     @given(
2025-05-07T20:31:48.6190217Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6190314Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6190532Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6190651Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6190764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6190836Z     )
2025-05-07T20:31:48.6191087Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6191181Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6191254Z         self,
2025-05-07T20:31:48.6191329Z         T: int,
2025-05-07T20:31:48.6191403Z         D: int,
2025-05-07T20:31:48.6191499Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6191590Z         contiguous: bool,
2025-05-07T20:31:48.6191674Z         compiled: bool,
2025-05-07T20:31:48.6191754Z     ) -> None:
2025-05-07T20:31:48.6191848Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6191921Z     
2025-05-07T20:31:48.6192096Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6192168Z     
2025-05-07T20:31:48.6192264Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6192391Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6192479Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6192557Z         x0 = x[:, :D]
2025-05-07T20:31:48.6192638Z         x1 = x[:, D:]
2025-05-07T20:31:48.6192707Z     
2025-05-07T20:31:48.6192788Z         if contiguous:
2025-05-07T20:31:48.6192880Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6192966Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6193037Z     
2025-05-07T20:31:48.6193131Z         if scale_ub is not None:
2025-05-07T20:31:48.6193234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6193373Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6193449Z             )
2025-05-07T20:31:48.6193526Z         else:
2025-05-07T20:31:48.6193698Z             scale_ub_tensor = None
2025-05-07T20:31:48.6193770Z     
2025-05-07T20:31:48.6193902Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6193998Z             op = silu_mul_quant
2025-05-07T20:31:48.6194084Z             if compiled:
2025-05-07T20:31:48.6194183Z                 op = torch.compile(op)
2025-05-07T20:31:48.6194289Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6194359Z     
2025-05-07T20:31:48.6194448Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6194454Z 
2025-05-07T20:31:48.6198856Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6198990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6199097Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6199197Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6199711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6199815Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6200181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6200408Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6200758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6200849Z     kernel = self.compile(
2025-05-07T20:31:48.6201242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6201417Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6201540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6201545Z 
2025-05-07T20:31:48.6201757Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f1cd7e0>
2025-05-07T20:31:48.6202556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6203177Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f2444c0>}
2025-05-07T20:31:48.6203943Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6204140Z context = <triton._C.libtriton.ir.context object at 0x7fc50f2a58f0>
2025-05-07T20:31:48.6204148Z 
2025-05-07T20:31:48.6204311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6204582Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6204692Z                            module_map=module_map)
2025-05-07T20:31:48.6204855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6204958Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6205039Z E       ^
2025-05-07T20:31:48.6205400Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6205405Z 
2025-05-07T20:31:48.6205830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6205835Z 
2025-05-07T20:31:48.6205937Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6206163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6206243Z     T=16384,
2025-05-07T20:31:48.6206337Z     D=7168,
2025-05-07T20:31:48.6206423Z     scale_ub=1200.0,
2025-05-07T20:31:48.6206533Z     contiguous=False,
2025-05-07T20:31:48.6206695Z     compiled=True,
2025-05-07T20:31:48.6206767Z )
2025-05-07T20:31:48.6206992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6207174Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:48.6207179Z 
2025-05-07T20:31:48.6207257Z     @given(
2025-05-07T20:31:48.6207374Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6207471Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6207589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6207705Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6207819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6207895Z     )
2025-05-07T20:31:48.6208143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6208240Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6208314Z         self,
2025-05-07T20:31:48.6208392Z         T: int,
2025-05-07T20:31:48.6208468Z         D: int,
2025-05-07T20:31:48.6208565Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6208653Z         contiguous: bool,
2025-05-07T20:31:48.6208745Z         compiled: bool,
2025-05-07T20:31:48.6208822Z     ) -> None:
2025-05-07T20:31:48.6208915Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6208987Z     
2025-05-07T20:31:48.6209157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6209229Z     
2025-05-07T20:31:48.6209322Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6209447Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6209533Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6209614Z         x0 = x[:, :D]
2025-05-07T20:31:48.6209690Z         x1 = x[:, D:]
2025-05-07T20:31:48.6209766Z     
2025-05-07T20:31:48.6209847Z         if contiguous:
2025-05-07T20:31:48.6209936Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6210025Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6210098Z     
2025-05-07T20:31:48.6210187Z         if scale_ub is not None:
2025-05-07T20:31:48.6210293Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6210511Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6210584Z             )
2025-05-07T20:31:48.6210660Z         else:
2025-05-07T20:31:48.6210752Z             scale_ub_tensor = None
2025-05-07T20:31:48.6210822Z     
2025-05-07T20:31:48.6210956Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6211044Z             op = silu_mul_quant
2025-05-07T20:31:48.6211129Z             if compiled:
2025-05-07T20:31:48.6211226Z                 op = torch.compile(op)
2025-05-07T20:31:48.6211329Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6211402Z     
2025-05-07T20:31:48.6211490Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6211494Z 
2025-05-07T20:31:48.6211590Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6211724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6211824Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6211923Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6212310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6212402Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6212911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6213006Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6213368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6213595Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6213943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6214112Z     kernel = self.compile(
2025-05-07T20:31:48.6214506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6214685Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6214812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6214816Z 
2025-05-07T20:31:48.6215022Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50f2eb9a0>
2025-05-07T20:31:48.6215819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6216336Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f2455a0>}
2025-05-07T20:31:48.6217153Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6217359Z context = <triton._C.libtriton.ir.context object at 0x7fc50eeb8d30>
2025-05-07T20:31:48.6217364Z 
2025-05-07T20:31:48.6217532Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6217805Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6217914Z                            module_map=module_map)
2025-05-07T20:31:48.6218148Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6218251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6218325Z E       ^
2025-05-07T20:31:48.6218687Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6218692Z 
2025-05-07T20:31:48.6219124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6219211Z 
2025-05-07T20:31:48.6219314Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6219548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6219624Z     T=1,
2025-05-07T20:31:48.6219698Z     D=7168,
2025-05-07T20:31:48.6219783Z     scale_ub=None,
2025-05-07T20:31:48.6219871Z     contiguous=False,
2025-05-07T20:31:48.6219953Z     compiled=False,
2025-05-07T20:31:48.6220027Z )
2025-05-07T20:31:48.6220248Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6220417Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.6220422Z 
2025-05-07T20:31:48.6220501Z     @given(
2025-05-07T20:31:48.6220621Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6220725Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6220843Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6220961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6221086Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6221157Z     )
2025-05-07T20:31:48.6221406Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6221500Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6221573Z         self,
2025-05-07T20:31:48.6221646Z         T: int,
2025-05-07T20:31:48.6221724Z         D: int,
2025-05-07T20:31:48.6221822Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6221910Z         contiguous: bool,
2025-05-07T20:31:48.6221996Z         compiled: bool,
2025-05-07T20:31:48.6222073Z     ) -> None:
2025-05-07T20:31:48.6222167Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6222245Z     
2025-05-07T20:31:48.6222415Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6222574Z     
2025-05-07T20:31:48.6222665Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6222789Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6222884Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6222962Z         x0 = x[:, :D]
2025-05-07T20:31:48.6223040Z         x1 = x[:, D:]
2025-05-07T20:31:48.6223114Z     
2025-05-07T20:31:48.6223196Z         if contiguous:
2025-05-07T20:31:48.6223286Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6223379Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6223451Z     
2025-05-07T20:31:48.6223540Z         if scale_ub is not None:
2025-05-07T20:31:48.6223648Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6223782Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6223860Z             )
2025-05-07T20:31:48.6223933Z         else:
2025-05-07T20:31:48.6224025Z             scale_ub_tensor = None
2025-05-07T20:31:48.6224099Z     
2025-05-07T20:31:48.6224235Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6224322Z             op = silu_mul_quant
2025-05-07T20:31:48.6224409Z             if compiled:
2025-05-07T20:31:48.6224510Z                 op = torch.compile(op)
2025-05-07T20:31:48.6224616Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6224689Z     
2025-05-07T20:31:48.6224779Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6224784Z 
2025-05-07T20:31:48.6224880Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6225010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6225115Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6225219Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6225732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6225828Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6226199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6226426Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6226885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6226977Z     kernel = self.compile(
2025-05-07T20:31:48.6227367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6227544Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6227668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6227673Z 
2025-05-07T20:31:48.6227881Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ee7f8e0>
2025-05-07T20:31:48.6228683Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6229199Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f245d80>}
2025-05-07T20:31:48.6229970Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6230167Z context = <triton._C.libtriton.ir.context object at 0x7fc50eec9570>
2025-05-07T20:31:48.6230171Z 
2025-05-07T20:31:48.6230341Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6230611Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6230720Z                            module_map=module_map)
2025-05-07T20:31:48.6230962Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6231061Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6231135Z E       ^
2025-05-07T20:31:48.6231505Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6231510Z 
2025-05-07T20:31:48.6231932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6231937Z 
2025-05-07T20:31:48.6232045Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6232269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6232345Z     T=2048,
2025-05-07T20:31:48.6232423Z     D=7168,
2025-05-07T20:31:48.6232502Z     scale_ub=None,
2025-05-07T20:31:48.6232589Z     contiguous=False,
2025-05-07T20:31:48.6232673Z     compiled=True,
2025-05-07T20:31:48.6232748Z )
2025-05-07T20:31:48.6232976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6233153Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:48.6233157Z 
2025-05-07T20:31:48.6233238Z     @given(
2025-05-07T20:31:48.6233361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6233460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6233577Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6233699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6233812Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6233884Z     )
2025-05-07T20:31:48.6234136Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6234228Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6234305Z         self,
2025-05-07T20:31:48.6234379Z         T: int,
2025-05-07T20:31:48.6234454Z         D: int,
2025-05-07T20:31:48.6234557Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6234649Z         contiguous: bool,
2025-05-07T20:31:48.6234733Z         compiled: bool,
2025-05-07T20:31:48.6234817Z     ) -> None:
2025-05-07T20:31:48.6234911Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6235066Z     
2025-05-07T20:31:48.6235238Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6235309Z     
2025-05-07T20:31:48.6235399Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6235527Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6235613Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6235694Z         x0 = x[:, :D]
2025-05-07T20:31:48.6235773Z         x1 = x[:, D:]
2025-05-07T20:31:48.6235844Z     
2025-05-07T20:31:48.6235930Z         if contiguous:
2025-05-07T20:31:48.6236019Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6236108Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6236183Z     
2025-05-07T20:31:48.6236273Z         if scale_ub is not None:
2025-05-07T20:31:48.6236387Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6236530Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6236622Z             )
2025-05-07T20:31:48.6236704Z         else:
2025-05-07T20:31:48.6236825Z             scale_ub_tensor = None
2025-05-07T20:31:48.6236896Z     
2025-05-07T20:31:48.6237026Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6237117Z             op = silu_mul_quant
2025-05-07T20:31:48.6237200Z             if compiled:
2025-05-07T20:31:48.6237309Z                 op = torch.compile(op)
2025-05-07T20:31:48.6237416Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6237485Z     
2025-05-07T20:31:48.6237578Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6237582Z 
2025-05-07T20:31:48.6237680Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6237807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6237911Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6238088Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6238465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6238566Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6239071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6239175Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6239540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6239765Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6240116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6240208Z     kernel = self.compile(
2025-05-07T20:31:48.6240605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6240782Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6240908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6240913Z 
2025-05-07T20:31:48.6241124Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50eebf9d0>
2025-05-07T20:31:48.6241915Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6242431Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f246f80>}
2025-05-07T20:31:48.6243198Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6243391Z context = <triton._C.libtriton.ir.context object at 0x7fc50ee5a5b0>
2025-05-07T20:31:48.6243483Z 
2025-05-07T20:31:48.6243649Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6243919Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6244028Z                            module_map=module_map)
2025-05-07T20:31:48.6244193Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6244292Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6244368Z E       ^
2025-05-07T20:31:48.6244727Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6244732Z 
2025-05-07T20:31:48.6245163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6245167Z 
2025-05-07T20:31:48.6245271Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6245496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6245579Z     T=4096,
2025-05-07T20:31:48.6245652Z     D=7168,
2025-05-07T20:31:48.6245732Z     scale_ub=None,
2025-05-07T20:31:48.6245823Z     contiguous=False,
2025-05-07T20:31:48.6245906Z     compiled=True,
2025-05-07T20:31:48.6245976Z )
2025-05-07T20:31:48.6246199Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6246372Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:48.6246377Z 
2025-05-07T20:31:48.6246454Z     @given(
2025-05-07T20:31:48.6246572Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6246669Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6246788Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6247020Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6247153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6247229Z     )
2025-05-07T20:31:48.6247482Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6247574Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6247650Z         self,
2025-05-07T20:31:48.6247724Z         T: int,
2025-05-07T20:31:48.6247802Z         D: int,
2025-05-07T20:31:48.6247899Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6247987Z         contiguous: bool,
2025-05-07T20:31:48.6248073Z         compiled: bool,
2025-05-07T20:31:48.6248150Z     ) -> None:
2025-05-07T20:31:48.6248244Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6248319Z     
2025-05-07T20:31:48.6248490Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6248561Z     
2025-05-07T20:31:48.6248655Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6248786Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6248874Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6248957Z         x0 = x[:, :D]
2025-05-07T20:31:48.6249036Z         x1 = x[:, D:]
2025-05-07T20:31:48.6249115Z     
2025-05-07T20:31:48.6249196Z         if contiguous:
2025-05-07T20:31:48.6249285Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6249374Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6249446Z     
2025-05-07T20:31:48.6249535Z         if scale_ub is not None:
2025-05-07T20:31:48.6249643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6249778Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6249850Z             )
2025-05-07T20:31:48.6249929Z         else:
2025-05-07T20:31:48.6250020Z             scale_ub_tensor = None
2025-05-07T20:31:48.6250091Z     
2025-05-07T20:31:48.6250229Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6250317Z             op = silu_mul_quant
2025-05-07T20:31:48.6250405Z             if compiled:
2025-05-07T20:31:48.6250509Z                 op = torch.compile(op)
2025-05-07T20:31:48.6250613Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6250769Z     
2025-05-07T20:31:48.6250859Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6250863Z 
2025-05-07T20:31:48.6250961Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6251092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6251193Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6251292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6251672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6251765Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6252273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6252369Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6252738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6252968Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6253318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6253411Z     kernel = self.compile(
2025-05-07T20:31:48.6253803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6253980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6254107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6254112Z 
2025-05-07T20:31:48.6254319Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ef33a30>
2025-05-07T20:31:48.6255211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6256056Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50f247e20>}
2025-05-07T20:31:48.6256864Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6257066Z context = <triton._C.libtriton.ir.context object at 0x7fc50ef1a5b0>
2025-05-07T20:31:48.6257072Z 
2025-05-07T20:31:48.6257236Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6257511Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6257624Z                            module_map=module_map)
2025-05-07T20:31:48.6257786Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6257887Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6257967Z E       ^
2025-05-07T20:31:48.6258381Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6258386Z 
2025-05-07T20:31:48.6258810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6258814Z 
2025-05-07T20:31:48.6258917Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6259147Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6259223Z     T=16384,
2025-05-07T20:31:48.6259298Z     D=5120,
2025-05-07T20:31:48.6259382Z     scale_ub=1200.0,
2025-05-07T20:31:48.6259467Z     contiguous=False,
2025-05-07T20:31:48.6259548Z     compiled=False,
2025-05-07T20:31:48.6259621Z )
2025-05-07T20:31:48.6259847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6260030Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.6260194Z 
2025-05-07T20:31:48.6260272Z     @given(
2025-05-07T20:31:48.6260391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6260494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6260610Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6260726Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6260848Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6260920Z     )
2025-05-07T20:31:48.6261177Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6261270Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6261343Z         self,
2025-05-07T20:31:48.6261420Z         T: int,
2025-05-07T20:31:48.6261494Z         D: int,
2025-05-07T20:31:48.6261595Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6261686Z         contiguous: bool,
2025-05-07T20:31:48.6261770Z         compiled: bool,
2025-05-07T20:31:48.6261852Z     ) -> None:
2025-05-07T20:31:48.6261949Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6262019Z     
2025-05-07T20:31:48.6262194Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6262265Z     
2025-05-07T20:31:48.6262355Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6262483Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6262570Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6262650Z         x0 = x[:, :D]
2025-05-07T20:31:48.6262736Z         x1 = x[:, D:]
2025-05-07T20:31:48.6262807Z     
2025-05-07T20:31:48.6262888Z         if contiguous:
2025-05-07T20:31:48.6262983Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6263071Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6263144Z     
2025-05-07T20:31:48.6263350Z         if scale_ub is not None:
2025-05-07T20:31:48.6263460Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6263596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6263679Z             )
2025-05-07T20:31:48.6263753Z         else:
2025-05-07T20:31:48.6263851Z             scale_ub_tensor = None
2025-05-07T20:31:48.6263921Z     
2025-05-07T20:31:48.6264051Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6264143Z             op = silu_mul_quant
2025-05-07T20:31:48.6264227Z             if compiled:
2025-05-07T20:31:48.6264324Z                 op = torch.compile(op)
2025-05-07T20:31:48.6264432Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6264504Z     
2025-05-07T20:31:48.6264594Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6264598Z 
2025-05-07T20:31:48.6264697Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6264823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6264932Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6265031Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6265540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6265647Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6266012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6266234Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6266636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6266727Z     kernel = self.compile(
2025-05-07T20:31:48.6267120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6267303Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6267426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6267431Z 
2025-05-07T20:31:48.6267727Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ef8d930>
2025-05-07T20:31:48.6268522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6269040Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ef417e0>}
2025-05-07T20:31:48.6269805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6270001Z context = <triton._C.libtriton.ir.context object at 0x7fc50ef924f0>
2025-05-07T20:31:48.6270010Z 
2025-05-07T20:31:48.6270176Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6270450Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6270562Z                            module_map=module_map)
2025-05-07T20:31:48.6270724Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6270823Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6270904Z E       ^
2025-05-07T20:31:48.6271266Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6271271Z 
2025-05-07T20:31:48.6271694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6271699Z 
2025-05-07T20:31:48.6271801Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6272171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6272251Z     T=16384,
2025-05-07T20:31:48.6272326Z     D=5120,
2025-05-07T20:31:48.6272412Z     scale_ub=1200.0,
2025-05-07T20:31:48.6272501Z     contiguous=True,
2025-05-07T20:31:48.6272583Z     compiled=True,
2025-05-07T20:31:48.6272654Z )
2025-05-07T20:31:48.6272882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6273060Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.6273064Z 
2025-05-07T20:31:48.6273142Z     @given(
2025-05-07T20:31:48.6273262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6273359Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6273477Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6273594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6273710Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6273793Z     )
2025-05-07T20:31:48.6274042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6274142Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6274215Z         self,
2025-05-07T20:31:48.6274290Z         T: int,
2025-05-07T20:31:48.6274365Z         D: int,
2025-05-07T20:31:48.6274462Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6274551Z         contiguous: bool,
2025-05-07T20:31:48.6274638Z         compiled: bool,
2025-05-07T20:31:48.6274714Z     ) -> None:
2025-05-07T20:31:48.6274810Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6274886Z     
2025-05-07T20:31:48.6275057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6275128Z     
2025-05-07T20:31:48.6275222Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6275349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6275434Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6275518Z         x0 = x[:, :D]
2025-05-07T20:31:48.6275595Z         x1 = x[:, D:]
2025-05-07T20:31:48.6275670Z     
2025-05-07T20:31:48.6275752Z         if contiguous:
2025-05-07T20:31:48.6275925Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6276016Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6276086Z     
2025-05-07T20:31:48.6276176Z         if scale_ub is not None:
2025-05-07T20:31:48.6276285Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6276419Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6276492Z             )
2025-05-07T20:31:48.6276582Z         else:
2025-05-07T20:31:48.6276687Z             scale_ub_tensor = None
2025-05-07T20:31:48.6276772Z     
2025-05-07T20:31:48.6276916Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6277005Z             op = silu_mul_quant
2025-05-07T20:31:48.6277091Z             if compiled:
2025-05-07T20:31:48.6277191Z                 op = torch.compile(op)
2025-05-07T20:31:48.6277301Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6277374Z     
2025-05-07T20:31:48.6277464Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6277469Z 
2025-05-07T20:31:48.6277571Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6277704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6277806Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6277905Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6278284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6278377Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6278890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6278985Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6279429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6279660Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6280008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6280104Z     kernel = self.compile(
2025-05-07T20:31:48.6280498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6280676Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6280801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6280806Z 
2025-05-07T20:31:48.6281012Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ef16f50>
2025-05-07T20:31:48.6281808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6282326Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ef41090>}
2025-05-07T20:31:48.6283092Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6283288Z context = <triton._C.libtriton.ir.context object at 0x7fc50eb8d9b0>
2025-05-07T20:31:48.6283293Z 
2025-05-07T20:31:48.6283458Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6283730Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6283839Z                            module_map=module_map)
2025-05-07T20:31:48.6284001Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6284108Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6284184Z E       ^
2025-05-07T20:31:48.6284545Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6284629Z 
2025-05-07T20:31:48.6285056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6285060Z 
2025-05-07T20:31:48.6285165Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6285393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6285470Z     T=16384,
2025-05-07T20:31:48.6285545Z     D=5120,
2025-05-07T20:31:48.6285632Z     scale_ub=None,
2025-05-07T20:31:48.6285719Z     contiguous=False,
2025-05-07T20:31:48.6285801Z     compiled=True,
2025-05-07T20:31:48.6285877Z )
2025-05-07T20:31:48.6286099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6286281Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:48.6286289Z 
2025-05-07T20:31:48.6286366Z     @given(
2025-05-07T20:31:48.6286494Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6286597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6286719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6286840Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6286961Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6287036Z     )
2025-05-07T20:31:48.6287285Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6287388Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6287462Z         self,
2025-05-07T20:31:48.6287540Z         T: int,
2025-05-07T20:31:48.6287614Z         D: int,
2025-05-07T20:31:48.6287713Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6287807Z         contiguous: bool,
2025-05-07T20:31:48.6287995Z         compiled: bool,
2025-05-07T20:31:48.6288076Z     ) -> None:
2025-05-07T20:31:48.6288176Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6288253Z     
2025-05-07T20:31:48.6288426Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6288503Z     
2025-05-07T20:31:48.6288594Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6288719Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6288811Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6288892Z         x0 = x[:, :D]
2025-05-07T20:31:48.6288974Z         x1 = x[:, D:]
2025-05-07T20:31:48.6289050Z     
2025-05-07T20:31:48.6289135Z         if contiguous:
2025-05-07T20:31:48.6289231Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6289319Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6289391Z     
2025-05-07T20:31:48.6289485Z         if scale_ub is not None:
2025-05-07T20:31:48.6289592Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6289733Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6289811Z             )
2025-05-07T20:31:48.6289889Z         else:
2025-05-07T20:31:48.6289986Z             scale_ub_tensor = None
2025-05-07T20:31:48.6290062Z     
2025-05-07T20:31:48.6290194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6290286Z             op = silu_mul_quant
2025-05-07T20:31:48.6290373Z             if compiled:
2025-05-07T20:31:48.6290473Z                 op = torch.compile(op)
2025-05-07T20:31:48.6290584Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6290656Z     
2025-05-07T20:31:48.6290747Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6290752Z 
2025-05-07T20:31:48.6290854Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6290984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6291088Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6291193Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6291574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6291668Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6292257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6292354Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6292723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6292946Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6293293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6293392Z     kernel = self.compile(
2025-05-07T20:31:48.6293781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6293966Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6294089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6294102Z 
2025-05-07T20:31:48.6294310Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ef8f4c0>
2025-05-07T20:31:48.6295109Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6295621Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ef42290>}
2025-05-07T20:31:48.6296491Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6296711Z context = <triton._C.libtriton.ir.context object at 0x7fc50eb1acf0>
2025-05-07T20:31:48.6296716Z 
2025-05-07T20:31:48.6296885Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6297161Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6297269Z                            module_map=module_map)
2025-05-07T20:31:48.6297439Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6297537Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6297615Z E       ^
2025-05-07T20:31:48.6297977Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6297982Z 
2025-05-07T20:31:48.6298453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6298458Z 
2025-05-07T20:31:48.6298573Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6298801Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6298885Z     T=2048,
2025-05-07T20:31:48.6298963Z     D=5120,
2025-05-07T20:31:48.6299044Z     scale_ub=None,
2025-05-07T20:31:48.6299130Z     contiguous=False,
2025-05-07T20:31:48.6299216Z     compiled=True,
2025-05-07T20:31:48.6299289Z )
2025-05-07T20:31:48.6299511Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6299691Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:48.6299696Z 
2025-05-07T20:31:48.6299773Z     @given(
2025-05-07T20:31:48.6299895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6299994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6300110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6300232Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6300352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6300426Z     )
2025-05-07T20:31:48.6300679Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6300857Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6300934Z         self,
2025-05-07T20:31:48.6301012Z         T: int,
2025-05-07T20:31:48.6301089Z         D: int,
2025-05-07T20:31:48.6301190Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6301285Z         contiguous: bool,
2025-05-07T20:31:48.6301370Z         compiled: bool,
2025-05-07T20:31:48.6301452Z     ) -> None:
2025-05-07T20:31:48.6301549Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6301621Z     
2025-05-07T20:31:48.6301796Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6301868Z     
2025-05-07T20:31:48.6301959Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6302089Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6302181Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6302261Z         x0 = x[:, :D]
2025-05-07T20:31:48.6302344Z         x1 = x[:, D:]
2025-05-07T20:31:48.6302416Z     
2025-05-07T20:31:48.6302505Z         if contiguous:
2025-05-07T20:31:48.6302599Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6302687Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6302763Z     
2025-05-07T20:31:48.6302855Z         if scale_ub is not None:
2025-05-07T20:31:48.6302960Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6303101Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6303176Z             )
2025-05-07T20:31:48.6303251Z         else:
2025-05-07T20:31:48.6303348Z             scale_ub_tensor = None
2025-05-07T20:31:48.6303421Z     
2025-05-07T20:31:48.6303553Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6303649Z             op = silu_mul_quant
2025-05-07T20:31:48.6303732Z             if compiled:
2025-05-07T20:31:48.6303914Z                 op = torch.compile(op)
2025-05-07T20:31:48.6304027Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6304100Z     
2025-05-07T20:31:48.6304199Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6304204Z 
2025-05-07T20:31:48.6304302Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6304430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6304539Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6304638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6305015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6305115Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6305620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6305719Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6306088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6306312Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6306717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6306811Z     kernel = self.compile(
2025-05-07T20:31:48.6307203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6307383Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6307508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6307512Z 
2025-05-07T20:31:48.6307722Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50eb517b0>
2025-05-07T20:31:48.6308519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6309033Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ef42170>}
2025-05-07T20:31:48.6309884Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6310080Z context = <triton._C.libtriton.ir.context object at 0x7fc50edac4f0>
2025-05-07T20:31:48.6310085Z 
2025-05-07T20:31:48.6310256Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6310526Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6310639Z                            module_map=module_map)
2025-05-07T20:31:48.6310807Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6310906Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6310985Z E       ^
2025-05-07T20:31:48.6311351Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6311355Z 
2025-05-07T20:31:48.6311779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6311784Z 
2025-05-07T20:31:48.6311891Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6312118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6312198Z     T=2048,
2025-05-07T20:31:48.6312273Z     D=5120,
2025-05-07T20:31:48.6312355Z     scale_ub=1200.0,
2025-05-07T20:31:48.6312447Z     contiguous=False,
2025-05-07T20:31:48.6312529Z     compiled=True,
2025-05-07T20:31:48.6312601Z )
2025-05-07T20:31:48.6312906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6313084Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:48.6313094Z 
2025-05-07T20:31:48.6313168Z     @given(
2025-05-07T20:31:48.6313289Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6313388Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6313507Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6313625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6313741Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6313817Z     )
2025-05-07T20:31:48.6314065Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6314158Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6314236Z         self,
2025-05-07T20:31:48.6314312Z         T: int,
2025-05-07T20:31:48.6314387Z         D: int,
2025-05-07T20:31:48.6314488Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6314582Z         contiguous: bool,
2025-05-07T20:31:48.6314667Z         compiled: bool,
2025-05-07T20:31:48.6314747Z     ) -> None:
2025-05-07T20:31:48.6314844Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6314918Z     
2025-05-07T20:31:48.6315090Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6315164Z     
2025-05-07T20:31:48.6315260Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6315387Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6315476Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6315557Z         x0 = x[:, :D]
2025-05-07T20:31:48.6315635Z         x1 = x[:, D:]
2025-05-07T20:31:48.6315707Z     
2025-05-07T20:31:48.6315792Z         if contiguous:
2025-05-07T20:31:48.6315884Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6315972Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6316046Z     
2025-05-07T20:31:48.6316137Z         if scale_ub is not None:
2025-05-07T20:31:48.6316249Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6316385Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6316459Z             )
2025-05-07T20:31:48.6316643Z         else:
2025-05-07T20:31:48.6316738Z             scale_ub_tensor = None
2025-05-07T20:31:48.6316812Z     
2025-05-07T20:31:48.6316946Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6317036Z             op = silu_mul_quant
2025-05-07T20:31:48.6317121Z             if compiled:
2025-05-07T20:31:48.6317222Z                 op = torch.compile(op)
2025-05-07T20:31:48.6317329Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6317401Z     
2025-05-07T20:31:48.6317498Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6317502Z 
2025-05-07T20:31:48.6317600Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6317731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6317835Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6317939Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6318319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6318418Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6318924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6319026Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6319392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6319619Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6319965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6320059Z     kernel = self.compile(
2025-05-07T20:31:48.6324733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6324935Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6325068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6325073Z 
2025-05-07T20:31:48.6325291Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50edba0e0>
2025-05-07T20:31:48.6326086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6326651Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ef43880>}
2025-05-07T20:31:48.6327422Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6327616Z context = <triton._C.libtriton.ir.context object at 0x7fc50edf9970>
2025-05-07T20:31:48.6327624Z 
2025-05-07T20:31:48.6327790Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6328057Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6328164Z                            module_map=module_map)
2025-05-07T20:31:48.6328328Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6328426Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6328502Z E       ^
2025-05-07T20:31:48.6328864Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6328869Z 
2025-05-07T20:31:48.6329303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6329308Z 
2025-05-07T20:31:48.6329411Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6329638Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6329799Z     T=4096,
2025-05-07T20:31:48.6329871Z     D=5120,
2025-05-07T20:31:48.6329950Z     scale_ub=1200.0,
2025-05-07T20:31:48.6330037Z     contiguous=True,
2025-05-07T20:31:48.6330117Z     compiled=True,
2025-05-07T20:31:48.6330188Z )
2025-05-07T20:31:48.6330414Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6330586Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.6330590Z 
2025-05-07T20:31:48.6330668Z     @given(
2025-05-07T20:31:48.6330785Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6330882Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6330999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6331122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6331236Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6331316Z     )
2025-05-07T20:31:48.6331564Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6331659Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6331732Z         self,
2025-05-07T20:31:48.6331804Z         T: int,
2025-05-07T20:31:48.6331879Z         D: int,
2025-05-07T20:31:48.6331974Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6332060Z         contiguous: bool,
2025-05-07T20:31:48.6332147Z         compiled: bool,
2025-05-07T20:31:48.6332222Z     ) -> None:
2025-05-07T20:31:48.6332315Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6332392Z     
2025-05-07T20:31:48.6332561Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6332631Z     
2025-05-07T20:31:48.6332724Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6332930Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6333021Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6333097Z         x0 = x[:, :D]
2025-05-07T20:31:48.6333178Z         x1 = x[:, D:]
2025-05-07T20:31:48.6333252Z     
2025-05-07T20:31:48.6333336Z         if contiguous:
2025-05-07T20:31:48.6333427Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6333515Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6333583Z     
2025-05-07T20:31:48.6333672Z         if scale_ub is not None:
2025-05-07T20:31:48.6333780Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6333915Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6333985Z             )
2025-05-07T20:31:48.6334062Z         else:
2025-05-07T20:31:48.6334153Z             scale_ub_tensor = None
2025-05-07T20:31:48.6334225Z     
2025-05-07T20:31:48.6334356Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6334443Z             op = silu_mul_quant
2025-05-07T20:31:48.6334533Z             if compiled:
2025-05-07T20:31:48.6334631Z                 op = torch.compile(op)
2025-05-07T20:31:48.6334736Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6334811Z     
2025-05-07T20:31:48.6334900Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6334904Z 
2025-05-07T20:31:48.6335004Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6335133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6335231Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6335331Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6335713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6335803Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6336312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6336410Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6336821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6337131Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6337476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6337571Z     kernel = self.compile(
2025-05-07T20:31:48.6337960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6338208Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6338338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6338343Z 
2025-05-07T20:31:48.6338551Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50edb4100>
2025-05-07T20:31:48.6339355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6339874Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ed78940>}
2025-05-07T20:31:48.6340638Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6340832Z context = <triton._C.libtriton.ir.context object at 0x7fc50ea1b170>
2025-05-07T20:31:48.6340837Z 
2025-05-07T20:31:48.6340999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6341271Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6341454Z                            module_map=module_map)
2025-05-07T20:31:48.6341619Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6341719Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6341796Z E       ^
2025-05-07T20:31:48.6342157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6342165Z 
2025-05-07T20:31:48.6342585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6342590Z 
2025-05-07T20:31:48.6342691Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6342920Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6342995Z     T=128,
2025-05-07T20:31:48.6343067Z     D=5120,
2025-05-07T20:31:48.6343150Z     scale_ub=1200.0,
2025-05-07T20:31:48.6343234Z     contiguous=False,
2025-05-07T20:31:48.6343313Z     compiled=True,
2025-05-07T20:31:48.6343391Z )
2025-05-07T20:31:48.6343610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6343785Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:48.6343794Z 
2025-05-07T20:31:48.6343868Z     @given(
2025-05-07T20:31:48.6343985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6344089Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6344204Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6344318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6344434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6344505Z     )
2025-05-07T20:31:48.6344751Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6344851Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6344923Z         self,
2025-05-07T20:31:48.6344995Z         T: int,
2025-05-07T20:31:48.6345072Z         D: int,
2025-05-07T20:31:48.6345173Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6345260Z         contiguous: bool,
2025-05-07T20:31:48.6345344Z         compiled: bool,
2025-05-07T20:31:48.6345503Z     ) -> None:
2025-05-07T20:31:48.6345600Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6345669Z     
2025-05-07T20:31:48.6345839Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6345910Z     
2025-05-07T20:31:48.6345999Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6346122Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6346212Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6346289Z         x0 = x[:, :D]
2025-05-07T20:31:48.6346367Z         x1 = x[:, D:]
2025-05-07T20:31:48.6346439Z     
2025-05-07T20:31:48.6346518Z         if contiguous:
2025-05-07T20:31:48.6346607Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6346697Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6346766Z     
2025-05-07T20:31:48.6346860Z         if scale_ub is not None:
2025-05-07T20:31:48.6346967Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6347099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6347179Z             )
2025-05-07T20:31:48.6347250Z         else:
2025-05-07T20:31:48.6347340Z             scale_ub_tensor = None
2025-05-07T20:31:48.6347412Z     
2025-05-07T20:31:48.6347541Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6347628Z             op = silu_mul_quant
2025-05-07T20:31:48.6347714Z             if compiled:
2025-05-07T20:31:48.6347811Z                 op = torch.compile(op)
2025-05-07T20:31:48.6347914Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6347986Z     
2025-05-07T20:31:48.6348073Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6348078Z 
2025-05-07T20:31:48.6348176Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6348303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6348478Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6348581Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6348954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6349049Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6349555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6349648Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6350015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6350239Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6350583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6350680Z     kernel = self.compile(
2025-05-07T20:31:48.6351073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6351248Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6351375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6351380Z 
2025-05-07T20:31:48.6351586Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ea44880>
2025-05-07T20:31:48.6352382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6352894Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ed791b0>}
2025-05-07T20:31:48.6353664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6353965Z context = <triton._C.libtriton.ir.context object at 0x7fc50ead0470>
2025-05-07T20:31:48.6353970Z 
2025-05-07T20:31:48.6354134Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6354407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6354512Z                            module_map=module_map)
2025-05-07T20:31:48.6354676Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6354773Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6354846Z E       ^
2025-05-07T20:31:48.6355209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6355214Z 
2025-05-07T20:31:48.6355882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6355896Z 
2025-05-07T20:31:48.6356050Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6356372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6356476Z     T=16384,
2025-05-07T20:31:48.6356574Z     D=7168,
2025-05-07T20:31:48.6356659Z     scale_ub=1200.0,
2025-05-07T20:31:48.6356758Z     contiguous=True,
2025-05-07T20:31:48.6356848Z     compiled=True,
2025-05-07T20:31:48.6356916Z )
2025-05-07T20:31:48.6357138Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6357317Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.6357321Z 
2025-05-07T20:31:48.6357392Z     @given(
2025-05-07T20:31:48.6357509Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6357609Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6357871Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6357997Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6358109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6358184Z     )
2025-05-07T20:31:48.6358434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6358526Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6358598Z         self,
2025-05-07T20:31:48.6358677Z         T: int,
2025-05-07T20:31:48.6358751Z         D: int,
2025-05-07T20:31:48.6358851Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6358942Z         contiguous: bool,
2025-05-07T20:31:48.6359024Z         compiled: bool,
2025-05-07T20:31:48.6359101Z     ) -> None:
2025-05-07T20:31:48.6359196Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6359265Z     
2025-05-07T20:31:48.6359438Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6359509Z     
2025-05-07T20:31:48.6359605Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6359735Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6359819Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6359901Z         x0 = x[:, :D]
2025-05-07T20:31:48.6359986Z         x1 = x[:, D:]
2025-05-07T20:31:48.6360056Z     
2025-05-07T20:31:48.6360138Z         if contiguous:
2025-05-07T20:31:48.6360230Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6360316Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6360385Z     
2025-05-07T20:31:48.6360474Z         if scale_ub is not None:
2025-05-07T20:31:48.6360577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6360712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6360785Z             )
2025-05-07T20:31:48.6360856Z         else:
2025-05-07T20:31:48.6360952Z             scale_ub_tensor = None
2025-05-07T20:31:48.6361020Z     
2025-05-07T20:31:48.6361152Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6361247Z             op = silu_mul_quant
2025-05-07T20:31:48.6361328Z             if compiled:
2025-05-07T20:31:48.6361427Z                 op = torch.compile(op)
2025-05-07T20:31:48.6361659Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6361729Z     
2025-05-07T20:31:48.6361817Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6361822Z 
2025-05-07T20:31:48.6361923Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6362049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6362150Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6362247Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6362622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6362719Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6363227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6363322Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6363687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6363917Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6364267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6364357Z     kernel = self.compile(
2025-05-07T20:31:48.6364745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6364924Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6365046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6365050Z 
2025-05-07T20:31:48.6365259Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50eac4370>
2025-05-07T20:31:48.6366666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6367241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ed797e0>}
2025-05-07T20:31:48.6368008Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6368202Z context = <triton._C.libtriton.ir.context object at 0x7fc50eaa9970>
2025-05-07T20:31:48.6368207Z 
2025-05-07T20:31:48.6368375Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6368647Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6368755Z                            module_map=module_map)
2025-05-07T20:31:48.6368921Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6369021Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6369098Z E       ^
2025-05-07T20:31:48.6369458Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6369462Z 
2025-05-07T20:31:48.6369881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6369886Z 
2025-05-07T20:31:48.6369993Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6370217Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6370290Z     T=16384,
2025-05-07T20:31:48.6370366Z     D=5120,
2025-05-07T20:31:48.6370446Z     scale_ub=1200.0,
2025-05-07T20:31:48.6370534Z     contiguous=True,
2025-05-07T20:31:48.6370614Z     compiled=False,
2025-05-07T20:31:48.6370682Z )
2025-05-07T20:31:48.6370907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6371168Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.6371173Z 
2025-05-07T20:31:48.6371246Z     @given(
2025-05-07T20:31:48.6371367Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6371464Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6371578Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6371700Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6371813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6371886Z     )
2025-05-07T20:31:48.6372133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6372223Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6372299Z         self,
2025-05-07T20:31:48.6372377Z         T: int,
2025-05-07T20:31:48.6372449Z         D: int,
2025-05-07T20:31:48.6372548Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6372643Z         contiguous: bool,
2025-05-07T20:31:48.6372725Z         compiled: bool,
2025-05-07T20:31:48.6372805Z     ) -> None:
2025-05-07T20:31:48.6372897Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6372965Z     
2025-05-07T20:31:48.6373137Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6373207Z     
2025-05-07T20:31:48.6373301Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6373424Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6373510Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6373591Z         x0 = x[:, :D]
2025-05-07T20:31:48.6373668Z         x1 = x[:, D:]
2025-05-07T20:31:48.6373738Z     
2025-05-07T20:31:48.6373822Z         if contiguous:
2025-05-07T20:31:48.6373910Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6374078Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6374153Z     
2025-05-07T20:31:48.6374241Z         if scale_ub is not None:
2025-05-07T20:31:48.6374344Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6374486Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6374558Z             )
2025-05-07T20:31:48.6374630Z         else:
2025-05-07T20:31:48.6374725Z             scale_ub_tensor = None
2025-05-07T20:31:48.6374792Z     
2025-05-07T20:31:48.6374924Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6375013Z             op = silu_mul_quant
2025-05-07T20:31:48.6375094Z             if compiled:
2025-05-07T20:31:48.6375194Z                 op = torch.compile(op)
2025-05-07T20:31:48.6375299Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6375367Z     
2025-05-07T20:31:48.6375460Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6375464Z 
2025-05-07T20:31:48.6375558Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6375689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6375793Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6375891Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6376410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6376506Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6376868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6377092Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6377436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6377529Z     kernel = self.compile(
2025-05-07T20:31:48.6377919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6378160Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6378287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6378379Z 
2025-05-07T20:31:48.6378586Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50eac7640>
2025-05-07T20:31:48.6379380Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6379893Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ed7a950>}
2025-05-07T20:31:48.6380656Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6380854Z context = <triton._C.libtriton.ir.context object at 0x7fc50e89da70>
2025-05-07T20:31:48.6380864Z 
2025-05-07T20:31:48.6381031Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6381302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6381408Z                            module_map=module_map)
2025-05-07T20:31:48.6381568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6381669Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6381742Z E       ^
2025-05-07T20:31:48.6382104Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6382109Z 
2025-05-07T20:31:48.6382531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6382536Z 
2025-05-07T20:31:48.6382786Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6383020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6383102Z     T=1,
2025-05-07T20:31:48.6383174Z     D=7168,
2025-05-07T20:31:48.6383258Z     scale_ub=1200.0,
2025-05-07T20:31:48.6383340Z     contiguous=False,
2025-05-07T20:31:48.6383420Z     compiled=False,
2025-05-07T20:31:48.6383491Z )
2025-05-07T20:31:48.6383710Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6383881Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.6383889Z 
2025-05-07T20:31:48.6383961Z     @given(
2025-05-07T20:31:48.6384077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6384176Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6384290Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6384407Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6384529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6384598Z     )
2025-05-07T20:31:48.6384844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6384942Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6385015Z         self,
2025-05-07T20:31:48.6385094Z         T: int,
2025-05-07T20:31:48.6385167Z         D: int,
2025-05-07T20:31:48.6385262Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6385353Z         contiguous: bool,
2025-05-07T20:31:48.6385437Z         compiled: bool,
2025-05-07T20:31:48.6385511Z     ) -> None:
2025-05-07T20:31:48.6385612Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6385681Z     
2025-05-07T20:31:48.6385855Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6385926Z     
2025-05-07T20:31:48.6386014Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6386142Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6386237Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6386312Z         x0 = x[:, :D]
2025-05-07T20:31:48.6386409Z         x1 = x[:, D:]
2025-05-07T20:31:48.6386586Z     
2025-05-07T20:31:48.6386676Z         if contiguous:
2025-05-07T20:31:48.6386769Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6386855Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6386923Z     
2025-05-07T20:31:48.6387014Z         if scale_ub is not None:
2025-05-07T20:31:48.6387117Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6387254Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6387326Z             )
2025-05-07T20:31:48.6387400Z         else:
2025-05-07T20:31:48.6387495Z             scale_ub_tensor = None
2025-05-07T20:31:48.6387563Z     
2025-05-07T20:31:48.6387690Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6387781Z             op = silu_mul_quant
2025-05-07T20:31:48.6387865Z             if compiled:
2025-05-07T20:31:48.6387968Z                 op = torch.compile(op)
2025-05-07T20:31:48.6388074Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6388143Z     
2025-05-07T20:31:48.6388240Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6388250Z 
2025-05-07T20:31:48.6388346Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6388471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6388574Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6388670Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6389179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6389278Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6389641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6389946Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6390295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6390393Z     kernel = self.compile(
2025-05-07T20:31:48.6390785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6390960Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6391085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6391089Z 
2025-05-07T20:31:48.6391299Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e8192d0>
2025-05-07T20:31:48.6392090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6392611Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ed7bac0>}
2025-05-07T20:31:48.6393377Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6393575Z context = <triton._C.libtriton.ir.context object at 0x7fc50e817e30>
2025-05-07T20:31:48.6393579Z 
2025-05-07T20:31:48.6393744Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6394014Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6394122Z                            module_map=module_map)
2025-05-07T20:31:48.6394282Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6394378Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6394461Z E       ^
2025-05-07T20:31:48.6394826Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6394830Z 
2025-05-07T20:31:48.6395340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6395345Z 
2025-05-07T20:31:48.6395446Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6395671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6395748Z     T=4096,
2025-05-07T20:31:48.6395821Z     D=7168,
2025-05-07T20:31:48.6395902Z     scale_ub=1200.0,
2025-05-07T20:31:48.6395990Z     contiguous=False,
2025-05-07T20:31:48.6396071Z     compiled=True,
2025-05-07T20:31:48.6396147Z )
2025-05-07T20:31:48.6396367Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6396563Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:48.6396569Z 
2025-05-07T20:31:48.6396653Z     @given(
2025-05-07T20:31:48.6396793Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6396889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6397014Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6397132Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6397245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6397322Z     )
2025-05-07T20:31:48.6397568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6397663Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6397737Z         self,
2025-05-07T20:31:48.6397809Z         T: int,
2025-05-07T20:31:48.6397885Z         D: int,
2025-05-07T20:31:48.6397983Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6398069Z         contiguous: bool,
2025-05-07T20:31:48.6398154Z         compiled: bool,
2025-05-07T20:31:48.6398228Z     ) -> None:
2025-05-07T20:31:48.6398399Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6398477Z     
2025-05-07T20:31:48.6398647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6398722Z     
2025-05-07T20:31:48.6398818Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6398941Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6399029Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6399106Z         x0 = x[:, :D]
2025-05-07T20:31:48.6399183Z         x1 = x[:, D:]
2025-05-07T20:31:48.6399254Z     
2025-05-07T20:31:48.6399334Z         if contiguous:
2025-05-07T20:31:48.6399423Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6399515Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6399585Z     
2025-05-07T20:31:48.6399674Z         if scale_ub is not None:
2025-05-07T20:31:48.6399781Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6399917Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6399987Z             )
2025-05-07T20:31:48.6400065Z         else:
2025-05-07T20:31:48.6400157Z             scale_ub_tensor = None
2025-05-07T20:31:48.6400226Z     
2025-05-07T20:31:48.6400362Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6400456Z             op = silu_mul_quant
2025-05-07T20:31:48.6400542Z             if compiled:
2025-05-07T20:31:48.6400639Z                 op = torch.compile(op)
2025-05-07T20:31:48.6400743Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6400816Z     
2025-05-07T20:31:48.6400904Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6400909Z 
2025-05-07T20:31:48.6401005Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6401137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6401237Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6401336Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6401723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6401813Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6402324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6402504Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6402866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6403090Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6403437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6403533Z     kernel = self.compile(
2025-05-07T20:31:48.6403924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6404098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6404229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6404233Z 
2025-05-07T20:31:48.6404440Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e8f6680>
2025-05-07T20:31:48.6405242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6405755Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ec80550>}
2025-05-07T20:31:48.6406515Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6406859Z context = <triton._C.libtriton.ir.context object at 0x7fc50ec62db0>
2025-05-07T20:31:48.6406864Z 
2025-05-07T20:31:48.6407032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6407307Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6407418Z                            module_map=module_map)
2025-05-07T20:31:48.6407580Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6407681Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6407753Z E       ^
2025-05-07T20:31:48.6408112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6408119Z 
2025-05-07T20:31:48.6408542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6408546Z 
2025-05-07T20:31:48.6408646Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6408880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6408953Z     T=128,
2025-05-07T20:31:48.6409025Z     D=7168,
2025-05-07T20:31:48.6409113Z     scale_ub=1200.0,
2025-05-07T20:31:48.6409202Z     contiguous=False,
2025-05-07T20:31:48.6409281Z     compiled=True,
2025-05-07T20:31:48.6409354Z )
2025-05-07T20:31:48.6409575Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6409748Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:31:48.6409753Z 
2025-05-07T20:31:48.6409826Z     @given(
2025-05-07T20:31:48.6409945Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6410046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6410162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6410279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6410397Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6410468Z     )
2025-05-07T20:31:48.6410720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6410814Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6410967Z         self,
2025-05-07T20:31:48.6411046Z         T: int,
2025-05-07T20:31:48.6411121Z         D: int,
2025-05-07T20:31:48.6411217Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6411310Z         contiguous: bool,
2025-05-07T20:31:48.6411395Z         compiled: bool,
2025-05-07T20:31:48.6411472Z     ) -> None:
2025-05-07T20:31:48.6411572Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6411646Z     
2025-05-07T20:31:48.6411815Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6411892Z     
2025-05-07T20:31:48.6411980Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6412103Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6412194Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6412271Z         x0 = x[:, :D]
2025-05-07T20:31:48.6412357Z         x1 = x[:, D:]
2025-05-07T20:31:48.6412425Z     
2025-05-07T20:31:48.6412506Z         if contiguous:
2025-05-07T20:31:48.6412601Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6412693Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6412765Z     
2025-05-07T20:31:48.6412861Z         if scale_ub is not None:
2025-05-07T20:31:48.6412965Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6413099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6413177Z             )
2025-05-07T20:31:48.6413251Z         else:
2025-05-07T20:31:48.6413341Z             scale_ub_tensor = None
2025-05-07T20:31:48.6413415Z     
2025-05-07T20:31:48.6413545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6413631Z             op = silu_mul_quant
2025-05-07T20:31:48.6413717Z             if compiled:
2025-05-07T20:31:48.6413814Z                 op = torch.compile(op)
2025-05-07T20:31:48.6413920Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6414092Z     
2025-05-07T20:31:48.6414185Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6414189Z 
2025-05-07T20:31:48.6414288Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6414417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6414516Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6414619Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6414998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6415096Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6415598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6415694Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6416062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6416291Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6416685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6416790Z     kernel = self.compile(
2025-05-07T20:31:48.6417179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6417360Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6417482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6417486Z 
2025-05-07T20:31:48.6417692Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50ec03c70>
2025-05-07T20:31:48.6418548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6419058Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ec80f70>}
2025-05-07T20:31:48.6419904Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6420099Z context = <triton._C.libtriton.ir.context object at 0x7fc50ec1c230>
2025-05-07T20:31:48.6420104Z 
2025-05-07T20:31:48.6420275Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6420543Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6420648Z                            module_map=module_map)
2025-05-07T20:31:48.6420815Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6420917Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6420991Z E       ^
2025-05-07T20:31:48.6421353Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6421363Z 
2025-05-07T20:31:48.6421786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6421790Z 
2025-05-07T20:31:48.6421896Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6422121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6422194Z     T=2048,
2025-05-07T20:31:48.6422272Z     D=7168,
2025-05-07T20:31:48.6422352Z     scale_ub=None,
2025-05-07T20:31:48.6422435Z     contiguous=True,
2025-05-07T20:31:48.6422521Z     compiled=True,
2025-05-07T20:31:48.6422591Z )
2025-05-07T20:31:48.6422812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6423063Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.6423067Z 
2025-05-07T20:31:48.6423144Z     @given(
2025-05-07T20:31:48.6423266Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6423371Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6423486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6423607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6423719Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6423789Z     )
2025-05-07T20:31:48.6424041Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6424134Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6424205Z         self,
2025-05-07T20:31:48.6424283Z         T: int,
2025-05-07T20:31:48.6424357Z         D: int,
2025-05-07T20:31:48.6424457Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6424543Z         contiguous: bool,
2025-05-07T20:31:48.6424626Z         compiled: bool,
2025-05-07T20:31:48.6424710Z     ) -> None:
2025-05-07T20:31:48.6424809Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6424880Z     
2025-05-07T20:31:48.6425053Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6425127Z     
2025-05-07T20:31:48.6425217Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6425348Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6425437Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6425517Z         x0 = x[:, :D]
2025-05-07T20:31:48.6425598Z         x1 = x[:, D:]
2025-05-07T20:31:48.6425668Z     
2025-05-07T20:31:48.6425749Z         if contiguous:
2025-05-07T20:31:48.6425844Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6425929Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6426003Z     
2025-05-07T20:31:48.6426091Z         if scale_ub is not None:
2025-05-07T20:31:48.6426194Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6426338Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6426409Z             )
2025-05-07T20:31:48.6426498Z         else:
2025-05-07T20:31:48.6426605Z             scale_ub_tensor = None
2025-05-07T20:31:48.6426777Z     
2025-05-07T20:31:48.6426908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6426999Z             op = silu_mul_quant
2025-05-07T20:31:48.6427082Z             if compiled:
2025-05-07T20:31:48.6427180Z                 op = torch.compile(op)
2025-05-07T20:31:48.6427288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6427357Z     
2025-05-07T20:31:48.6427453Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6427457Z 
2025-05-07T20:31:48.6427553Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6427678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6427782Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6427881Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6428261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6428359Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6428860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6428966Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6429329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6429551Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6429900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6429992Z     kernel = self.compile(
2025-05-07T20:31:48.6430383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6430640Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6430763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6430772Z 
2025-05-07T20:31:48.6430983Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e80ec20>
2025-05-07T20:31:48.6431779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6432295Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ec81bd0>}
2025-05-07T20:31:48.6433054Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6433250Z context = <triton._C.libtriton.ir.context object at 0x7fc50e9318f0>
2025-05-07T20:31:48.6433254Z 
2025-05-07T20:31:48.6433424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6433696Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6433808Z                            module_map=module_map)
2025-05-07T20:31:48.6433969Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6434064Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6434141Z E       ^
2025-05-07T20:31:48.6434498Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6434502Z 
2025-05-07T20:31:48.6434922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6434932Z 
2025-05-07T20:31:48.6435036Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6435267Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6435346Z     T=16384,
2025-05-07T20:31:48.6435498Z     D=5120,
2025-05-07T20:31:48.6435583Z     scale_ub=None,
2025-05-07T20:31:48.6435671Z     contiguous=False,
2025-05-07T20:31:48.6435752Z     compiled=False,
2025-05-07T20:31:48.6435823Z )
2025-05-07T20:31:48.6436049Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6436227Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.6436232Z 
2025-05-07T20:31:48.6436310Z     @given(
2025-05-07T20:31:48.6436427Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6436525Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6436645Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6436762Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6436880Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6436955Z     )
2025-05-07T20:31:48.6437203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6437300Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6437376Z         self,
2025-05-07T20:31:48.6437449Z         T: int,
2025-05-07T20:31:48.6437520Z         D: int,
2025-05-07T20:31:48.6437619Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6437705Z         contiguous: bool,
2025-05-07T20:31:48.6437791Z         compiled: bool,
2025-05-07T20:31:48.6437867Z     ) -> None:
2025-05-07T20:31:48.6437960Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6438035Z     
2025-05-07T20:31:48.6438205Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6438275Z     
2025-05-07T20:31:48.6438368Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6438493Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6440454Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6440466Z 
2025-05-07T20:31:48.6440585Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:48.6440589Z 
2025-05-07T20:31:48.6440692Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6440920Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6440994Z     T=4096,
2025-05-07T20:31:48.6441072Z     D=7168,
2025-05-07T20:31:48.6441152Z     scale_ub=1200.0,
2025-05-07T20:31:48.6441234Z     contiguous=True,
2025-05-07T20:31:48.6441328Z     compiled=True,
2025-05-07T20:31:48.6441397Z )
2025-05-07T20:31:48.6441620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6441800Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.6441805Z 
2025-05-07T20:31:48.6441877Z     @given(
2025-05-07T20:31:48.6441992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6442092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6442204Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6442326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6442438Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6442510Z     )
2025-05-07T20:31:48.6442757Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6442849Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6442921Z         self,
2025-05-07T20:31:48.6443001Z         T: int,
2025-05-07T20:31:48.6443073Z         D: int,
2025-05-07T20:31:48.6443169Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6443262Z         contiguous: bool,
2025-05-07T20:31:48.6443451Z         compiled: bool,
2025-05-07T20:31:48.6443527Z     ) -> None:
2025-05-07T20:31:48.6443625Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6443693Z     
2025-05-07T20:31:48.6443864Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6443935Z     
2025-05-07T20:31:48.6444024Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6444150Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6446008Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6446022Z 
2025-05-07T20:31:48.6446147Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:48.6446152Z 
2025-05-07T20:31:48.6446257Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6446497Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6446582Z     T=16384,
2025-05-07T20:31:48.6446668Z     D=7168,
2025-05-07T20:31:48.6446756Z     scale_ub=None,
2025-05-07T20:31:48.6446848Z     contiguous=False,
2025-05-07T20:31:48.6446930Z     compiled=False,
2025-05-07T20:31:48.6447003Z )
2025-05-07T20:31:48.6447222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6451889Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.6452001Z 
2025-05-07T20:31:48.6452096Z     @given(
2025-05-07T20:31:48.6452219Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6452327Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6452444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6452559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6452670Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6452746Z     )
2025-05-07T20:31:48.6452998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6453092Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6453166Z         self,
2025-05-07T20:31:48.6453242Z         T: int,
2025-05-07T20:31:48.6453320Z         D: int,
2025-05-07T20:31:48.6453417Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6453506Z         contiguous: bool,
2025-05-07T20:31:48.6453595Z         compiled: bool,
2025-05-07T20:31:48.6453678Z     ) -> None:
2025-05-07T20:31:48.6453781Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6453856Z     
2025-05-07T20:31:48.6454027Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6456312Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6456327Z 
2025-05-07T20:31:48.6456453Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6456458Z 
2025-05-07T20:31:48.6456562Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6456837Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6456929Z     T=2048,
2025-05-07T20:31:48.6457006Z     D=7168,
2025-05-07T20:31:48.6457253Z     scale_ub=1200.0,
2025-05-07T20:31:48.6457335Z     contiguous=True,
2025-05-07T20:31:48.6457420Z     compiled=True,
2025-05-07T20:31:48.6457490Z )
2025-05-07T20:31:48.6457714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6457892Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.6457897Z 
2025-05-07T20:31:48.6457969Z     @given(
2025-05-07T20:31:48.6458140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6458240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6458353Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6458468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6458586Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6458665Z     )
2025-05-07T20:31:48.6458919Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6459019Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6459093Z         self,
2025-05-07T20:31:48.6459172Z         T: int,
2025-05-07T20:31:48.6459244Z         D: int,
2025-05-07T20:31:48.6459340Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6459435Z         contiguous: bool,
2025-05-07T20:31:48.6459517Z         compiled: bool,
2025-05-07T20:31:48.6459595Z     ) -> None:
2025-05-07T20:31:48.6459696Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6459765Z     
2025-05-07T20:31:48.6459936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6460014Z     
2025-05-07T20:31:48.6460106Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6460236Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6462200Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6462211Z 
2025-05-07T20:31:48.6462335Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:48.6462340Z 
2025-05-07T20:31:48.6462442Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6462669Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6462747Z     T=2048,
2025-05-07T20:31:48.6462820Z     D=7168,
2025-05-07T20:31:48.6462903Z     scale_ub=None,
2025-05-07T20:31:48.6462990Z     contiguous=True,
2025-05-07T20:31:48.6463076Z     compiled=False,
2025-05-07T20:31:48.6463147Z )
2025-05-07T20:31:48.6463372Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6463550Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.6463555Z 
2025-05-07T20:31:48.6463634Z     @given(
2025-05-07T20:31:48.6463752Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6463848Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6463968Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6464083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6464200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6464275Z     )
2025-05-07T20:31:48.6464523Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6464621Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6464695Z         self,
2025-05-07T20:31:48.6464767Z         T: int,
2025-05-07T20:31:48.6464849Z         D: int,
2025-05-07T20:31:48.6464946Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6465034Z         contiguous: bool,
2025-05-07T20:31:48.6465210Z         compiled: bool,
2025-05-07T20:31:48.6465286Z     ) -> None:
2025-05-07T20:31:48.6465380Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6465458Z     
2025-05-07T20:31:48.6465625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6465697Z     
2025-05-07T20:31:48.6465794Z >       x_sign = torch.sign(x)
2025-05-07T20:31:48.6467633Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6467640Z 
2025-05-07T20:31:48.6467766Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:48.6467771Z 
2025-05-07T20:31:48.6467873Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6468105Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6468181Z     T=1,
2025-05-07T20:31:48.6468253Z     D=7168,
2025-05-07T20:31:48.6468337Z     scale_ub=1200.0,
2025-05-07T20:31:48.6468420Z     contiguous=True,
2025-05-07T20:31:48.6468500Z     compiled=False,
2025-05-07T20:31:48.6468575Z )
2025-05-07T20:31:48.6468794Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6468961Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.6468966Z 
2025-05-07T20:31:48.6469047Z     @given(
2025-05-07T20:31:48.6469244Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6469348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6469463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6469585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6469703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6469774Z     )
2025-05-07T20:31:48.6470022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6470121Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6470199Z         self,
2025-05-07T20:31:48.6470271Z         T: int,
2025-05-07T20:31:48.6470351Z         D: int,
2025-05-07T20:31:48.6470448Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6470533Z         contiguous: bool,
2025-05-07T20:31:48.6470625Z         compiled: bool,
2025-05-07T20:31:48.6470701Z     ) -> None:
2025-05-07T20:31:48.6470794Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6470868Z     
2025-05-07T20:31:48.6471040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6471114Z     
2025-05-07T20:31:48.6471210Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6471339Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6471431Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6471507Z         x0 = x[:, :D]
2025-05-07T20:31:48.6471583Z         x1 = x[:, D:]
2025-05-07T20:31:48.6471658Z     
2025-05-07T20:31:48.6471740Z         if contiguous:
2025-05-07T20:31:48.6471834Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6471926Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6471996Z     
2025-05-07T20:31:48.6472084Z         if scale_ub is not None:
2025-05-07T20:31:48.6472194Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6472334Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6472406Z             )
2025-05-07T20:31:48.6472483Z         else:
2025-05-07T20:31:48.6472577Z             scale_ub_tensor = None
2025-05-07T20:31:48.6472658Z     
2025-05-07T20:31:48.6472790Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6472879Z             op = silu_mul_quant
2025-05-07T20:31:48.6473051Z             if compiled:
2025-05-07T20:31:48.6473150Z                 op = torch.compile(op)
2025-05-07T20:31:48.6473254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6473328Z     
2025-05-07T20:31:48.6473418Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6473423Z 
2025-05-07T20:31:48.6473519Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6473655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6473757Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6473862Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6474378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6474478Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6474854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6475085Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6475434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6475530Z     kernel = self.compile(
2025-05-07T20:31:48.6475921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6476102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6476230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6476235Z 
2025-05-07T20:31:48.6476465Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e920850>
2025-05-07T20:31:48.6477369Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6477891Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50ec83b50>}
2025-05-07T20:31:48.6478660Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6478856Z context = <triton._C.libtriton.ir.context object at 0x7fc50e516bf0>
2025-05-07T20:31:48.6478860Z 
2025-05-07T20:31:48.6479032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6479302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6479417Z                            module_map=module_map)
2025-05-07T20:31:48.6479588Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6479685Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6479763Z E       ^
2025-05-07T20:31:48.6480130Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6480135Z 
2025-05-07T20:31:48.6480557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6480561Z 
2025-05-07T20:31:48.6480670Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6480897Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6480976Z     T=128,
2025-05-07T20:31:48.6481056Z     D=5120,
2025-05-07T20:31:48.6481135Z     scale_ub=None,
2025-05-07T20:31:48.6481217Z     contiguous=True,
2025-05-07T20:31:48.6481302Z     compiled=False,
2025-05-07T20:31:48.6481378Z )
2025-05-07T20:31:48.6481599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6481778Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.6481890Z 
2025-05-07T20:31:48.6481965Z     @given(
2025-05-07T20:31:48.6482091Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6482190Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6482306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6482428Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6482542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6482611Z     )
2025-05-07T20:31:48.6482864Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6482957Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6483030Z         self,
2025-05-07T20:31:48.6483107Z         T: int,
2025-05-07T20:31:48.6483179Z         D: int,
2025-05-07T20:31:48.6483286Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6483373Z         contiguous: bool,
2025-05-07T20:31:48.6483455Z         compiled: bool,
2025-05-07T20:31:48.6483541Z     ) -> None:
2025-05-07T20:31:48.6483634Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6483704Z     
2025-05-07T20:31:48.6483877Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6483949Z     
2025-05-07T20:31:48.6484039Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6484168Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6484255Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6484333Z         x0 = x[:, :D]
2025-05-07T20:31:48.6484415Z         x1 = x[:, D:]
2025-05-07T20:31:48.6484486Z     
2025-05-07T20:31:48.6484568Z         if contiguous:
2025-05-07T20:31:48.6484662Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6484748Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6484821Z     
2025-05-07T20:31:48.6484990Z         if scale_ub is not None:
2025-05-07T20:31:48.6485098Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6485238Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6485316Z             )
2025-05-07T20:31:48.6485387Z         else:
2025-05-07T20:31:48.6485486Z             scale_ub_tensor = None
2025-05-07T20:31:48.6485557Z     
2025-05-07T20:31:48.6485690Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6485780Z             op = silu_mul_quant
2025-05-07T20:31:48.6485863Z             if compiled:
2025-05-07T20:31:48.6485962Z                 op = torch.compile(op)
2025-05-07T20:31:48.6486071Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6486140Z     
2025-05-07T20:31:48.6486233Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6486238Z 
2025-05-07T20:31:48.6486334Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6486473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6486595Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6486717Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6487230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6487334Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6487703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6487932Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6488277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6488368Z     kernel = self.compile(
2025-05-07T20:31:48.6488762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6488943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6489067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6489075Z 
2025-05-07T20:31:48.6489367Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e922140>
2025-05-07T20:31:48.6490163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6490682Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50e564670>}
2025-05-07T20:31:48.6491447Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6491651Z context = <triton._C.libtriton.ir.context object at 0x7fc50e5acc30>
2025-05-07T20:31:48.6491655Z 
2025-05-07T20:31:48.6491821Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6492095Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6492207Z                            module_map=module_map)
2025-05-07T20:31:48.6492370Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6492471Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6492544Z E       ^
2025-05-07T20:31:48.6492903Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6492908Z 
2025-05-07T20:31:48.6493336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6493340Z 
2025-05-07T20:31:48.6493444Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6493748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6493829Z     T=128,
2025-05-07T20:31:48.6493902Z     D=7168,
2025-05-07T20:31:48.6493988Z     scale_ub=None,
2025-05-07T20:31:48.6494069Z     contiguous=True,
2025-05-07T20:31:48.6494150Z     compiled=False,
2025-05-07T20:31:48.6494223Z )
2025-05-07T20:31:48.6494447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6494617Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.6494621Z 
2025-05-07T20:31:48.6494698Z     @given(
2025-05-07T20:31:48.6494816Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6494914Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6495032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6495149Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6495267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6495343Z     )
2025-05-07T20:31:48.6495591Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6495691Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6495764Z         self,
2025-05-07T20:31:48.6495836Z         T: int,
2025-05-07T20:31:48.6495913Z         D: int,
2025-05-07T20:31:48.6496011Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6496098Z         contiguous: bool,
2025-05-07T20:31:48.6496185Z         compiled: bool,
2025-05-07T20:31:48.6496260Z     ) -> None:
2025-05-07T20:31:48.6496354Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6496429Z     
2025-05-07T20:31:48.6496598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6496672Z     
2025-05-07T20:31:48.6496761Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6496883Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6496973Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6497054Z         x0 = x[:, :D]
2025-05-07T20:31:48.6497132Z         x1 = x[:, D:]
2025-05-07T20:31:48.6497207Z     
2025-05-07T20:31:48.6497288Z         if contiguous:
2025-05-07T20:31:48.6497464Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6497555Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6497624Z     
2025-05-07T20:31:48.6497712Z         if scale_ub is not None:
2025-05-07T20:31:48.6497820Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6497954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6498127Z             )
2025-05-07T20:31:48.6498202Z         else:
2025-05-07T20:31:48.6498294Z             scale_ub_tensor = None
2025-05-07T20:31:48.6498368Z     
2025-05-07T20:31:48.6498498Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6498585Z             op = silu_mul_quant
2025-05-07T20:31:48.6498672Z             if compiled:
2025-05-07T20:31:48.6498770Z                 op = torch.compile(op)
2025-05-07T20:31:48.6498879Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6498955Z     
2025-05-07T20:31:48.6499043Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6499048Z 
2025-05-07T20:31:48.6499151Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6499280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6499380Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6499483Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6499991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6500086Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6500453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6500676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6501105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6501199Z     kernel = self.compile(
2025-05-07T20:31:48.6501591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6501773Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6501896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6501901Z 
2025-05-07T20:31:48.6502107Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e922e90>
2025-05-07T20:31:48.6502903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6503418Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50e564ee0>}
2025-05-07T20:31:48.6504184Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6504382Z context = <triton._C.libtriton.ir.context object at 0x7fc50e5f2bb0>
2025-05-07T20:31:48.6504386Z 
2025-05-07T20:31:48.6504555Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6504823Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6504928Z                            module_map=module_map)
2025-05-07T20:31:48.6505093Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6505190Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6505266Z E       ^
2025-05-07T20:31:48.6505635Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6505640Z 
2025-05-07T20:31:48.6506061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6506144Z 
2025-05-07T20:31:48.6506253Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6506504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6506596Z     T=2048,
2025-05-07T20:31:48.6506684Z     D=7168,
2025-05-07T20:31:48.6506766Z     scale_ub=1200.0,
2025-05-07T20:31:48.6506849Z     contiguous=True,
2025-05-07T20:31:48.6506934Z     compiled=False,
2025-05-07T20:31:48.6507005Z )
2025-05-07T20:31:48.6507224Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6507402Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.6507406Z 
2025-05-07T20:31:48.6507484Z     @given(
2025-05-07T20:31:48.6507609Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6507710Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6507828Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6507956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6508071Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6508142Z     )
2025-05-07T20:31:48.6508394Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6508487Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6508561Z         self,
2025-05-07T20:31:48.6508634Z         T: int,
2025-05-07T20:31:48.6508707Z         D: int,
2025-05-07T20:31:48.6508808Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6508894Z         contiguous: bool,
2025-05-07T20:31:48.6508976Z         compiled: bool,
2025-05-07T20:31:48.6509055Z     ) -> None:
2025-05-07T20:31:48.6509148Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6509218Z     
2025-05-07T20:31:48.6509496Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6511348Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6511359Z 
2025-05-07T20:31:48.6511481Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6511486Z 
2025-05-07T20:31:48.6511586Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6511820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6511895Z     T=1,
2025-05-07T20:31:48.6511972Z     D=5120,
2025-05-07T20:31:48.6512058Z     scale_ub=1200.0,
2025-05-07T20:31:48.6512140Z     contiguous=True,
2025-05-07T20:31:48.6512230Z     compiled=False,
2025-05-07T20:31:48.6512305Z )
2025-05-07T20:31:48.6512525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6512695Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.6512700Z 
2025-05-07T20:31:48.6512775Z     @given(
2025-05-07T20:31:48.6512892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6512994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6513107Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6513223Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6513337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6513408Z     )
2025-05-07T20:31:48.6513661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6513753Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6513828Z         self,
2025-05-07T20:31:48.6513906Z         T: int,
2025-05-07T20:31:48.6514059Z         D: int,
2025-05-07T20:31:48.6514154Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6514248Z         contiguous: bool,
2025-05-07T20:31:48.6514330Z         compiled: bool,
2025-05-07T20:31:48.6514409Z     ) -> None:
2025-05-07T20:31:48.6514507Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6514578Z     
2025-05-07T20:31:48.6514747Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6514821Z     
2025-05-07T20:31:48.6514910Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6515041Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6515126Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6515203Z         x0 = x[:, :D]
2025-05-07T20:31:48.6515285Z         x1 = x[:, D:]
2025-05-07T20:31:48.6515357Z     
2025-05-07T20:31:48.6515445Z         if contiguous:
2025-05-07T20:31:48.6515544Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6515631Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6515706Z     
2025-05-07T20:31:48.6515798Z         if scale_ub is not None:
2025-05-07T20:31:48.6515901Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6516036Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6516110Z             )
2025-05-07T20:31:48.6516182Z         else:
2025-05-07T20:31:48.6516274Z             scale_ub_tensor = None
2025-05-07T20:31:48.6516347Z     
2025-05-07T20:31:48.6516477Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6516583Z             op = silu_mul_quant
2025-05-07T20:31:48.6516675Z             if compiled:
2025-05-07T20:31:48.6516789Z                 op = torch.compile(op)
2025-05-07T20:31:48.6516905Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6516976Z     
2025-05-07T20:31:48.6517065Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6517150Z 
2025-05-07T20:31:48.6517252Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6517379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6517483Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6517584Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6518094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6518192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6518559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6518782Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6519134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6519226Z     kernel = self.compile(
2025-05-07T20:31:48.6519620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6519802Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6519930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6519935Z 
2025-05-07T20:31:48.6520146Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e5fdba0>
2025-05-07T20:31:48.6520938Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6521453Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50e565e10>}
2025-05-07T20:31:48.6522219Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6522493Z context = <triton._C.libtriton.ir.context object at 0x7fc50e7dfab0>
2025-05-07T20:31:48.6522498Z 
2025-05-07T20:31:48.6522667Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6522935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6523044Z                            module_map=module_map)
2025-05-07T20:31:48.6523204Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6523302Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6523380Z E       ^
2025-05-07T20:31:48.6523741Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6523746Z 
2025-05-07T20:31:48.6524172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6524181Z 
2025-05-07T20:31:48.6524283Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6524516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6524592Z     T=2048,
2025-05-07T20:31:48.6524664Z     D=5120,
2025-05-07T20:31:48.6524743Z     scale_ub=None,
2025-05-07T20:31:48.6524825Z     contiguous=True,
2025-05-07T20:31:48.6524905Z     compiled=False,
2025-05-07T20:31:48.6524976Z )
2025-05-07T20:31:48.6525201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6525373Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.6525378Z 
2025-05-07T20:31:48.6525453Z     @given(
2025-05-07T20:31:48.6525573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6525670Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6525863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6525981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6526093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6526173Z     )
2025-05-07T20:31:48.6526419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6526510Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6526589Z         self,
2025-05-07T20:31:48.6526666Z         T: int,
2025-05-07T20:31:48.6526739Z         D: int,
2025-05-07T20:31:48.6526837Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6526925Z         contiguous: bool,
2025-05-07T20:31:48.6527011Z         compiled: bool,
2025-05-07T20:31:48.6527086Z     ) -> None:
2025-05-07T20:31:48.6527179Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6527249Z     
2025-05-07T20:31:48.6527417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6527488Z     
2025-05-07T20:31:48.6527592Z >       x_sign = torch.sign(x)
2025-05-07T20:31:48.6529444Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6529454Z 
2025-05-07T20:31:48.6529576Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:48.6529581Z 
2025-05-07T20:31:48.6529684Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6529911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6529990Z     T=16384,
2025-05-07T20:31:48.6530066Z     D=5120,
2025-05-07T20:31:48.6530156Z     scale_ub=None,
2025-05-07T20:31:48.6530246Z     contiguous=True,
2025-05-07T20:31:48.6530326Z     compiled=False,
2025-05-07T20:31:48.6530484Z )
2025-05-07T20:31:48.6530704Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6530880Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.6530885Z 
2025-05-07T20:31:48.6530963Z     @given(
2025-05-07T20:31:48.6531079Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6531177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6531293Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6531411Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6531531Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6531602Z     )
2025-05-07T20:31:48.6531848Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6531952Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6532028Z         self,
2025-05-07T20:31:48.6532103Z         T: int,
2025-05-07T20:31:48.6532179Z         D: int,
2025-05-07T20:31:48.6532283Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6532372Z         contiguous: bool,
2025-05-07T20:31:48.6532462Z         compiled: bool,
2025-05-07T20:31:48.6532538Z     ) -> None:
2025-05-07T20:31:48.6532630Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6532706Z     
2025-05-07T20:31:48.6532874Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6534792Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6534802Z 
2025-05-07T20:31:48.6534922Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6534927Z 
2025-05-07T20:31:48.6535031Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6535255Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6535327Z     T=4096,
2025-05-07T20:31:48.6535404Z     D=5120,
2025-05-07T20:31:48.6535482Z     scale_ub=None,
2025-05-07T20:31:48.6535565Z     contiguous=True,
2025-05-07T20:31:48.6535649Z     compiled=False,
2025-05-07T20:31:48.6535716Z )
2025-05-07T20:31:48.6535933Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6536105Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.6536109Z 
2025-05-07T20:31:48.6536183Z     @given(
2025-05-07T20:31:48.6536305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6536418Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6536549Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6536686Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6536798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6536870Z     )
2025-05-07T20:31:48.6537119Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6537210Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6537283Z         self,
2025-05-07T20:31:48.6537361Z         T: int,
2025-05-07T20:31:48.6537432Z         D: int,
2025-05-07T20:31:48.6537534Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6537622Z         contiguous: bool,
2025-05-07T20:31:48.6537705Z         compiled: bool,
2025-05-07T20:31:48.6537784Z     ) -> None:
2025-05-07T20:31:48.6537878Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6537953Z     
2025-05-07T20:31:48.6538209Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6540040Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6540199Z 
2025-05-07T20:31:48.6540320Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6540325Z 
2025-05-07T20:31:48.6540429Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6540653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6540738Z     T=2048,
2025-05-07T20:31:48.6540812Z     D=5120,
2025-05-07T20:31:48.6540897Z     scale_ub=None,
2025-05-07T20:31:48.6540983Z     contiguous=False,
2025-05-07T20:31:48.6541070Z     compiled=False,
2025-05-07T20:31:48.6541146Z )
2025-05-07T20:31:48.6541365Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6541538Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.6541543Z 
2025-05-07T20:31:48.6541620Z     @given(
2025-05-07T20:31:48.6541738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6541835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6541954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6542070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6542190Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6542261Z     )
2025-05-07T20:31:48.6542616Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6542711Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6542784Z         self,
2025-05-07T20:31:48.6542859Z         T: int,
2025-05-07T20:31:48.6542935Z         D: int,
2025-05-07T20:31:48.6543030Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6543117Z         contiguous: bool,
2025-05-07T20:31:48.6543202Z         compiled: bool,
2025-05-07T20:31:48.6543279Z     ) -> None:
2025-05-07T20:31:48.6543372Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6543446Z     
2025-05-07T20:31:48.6543614Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6545455Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6545464Z 
2025-05-07T20:31:48.6545583Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6545587Z 
2025-05-07T20:31:48.6545693Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6545917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6545992Z     T=4096,
2025-05-07T20:31:48.6546071Z     D=7168,
2025-05-07T20:31:48.6546152Z     scale_ub=None,
2025-05-07T20:31:48.6546236Z     contiguous=True,
2025-05-07T20:31:48.6546319Z     compiled=True,
2025-05-07T20:31:48.6546392Z )
2025-05-07T20:31:48.6546637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6546830Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.6546839Z 
2025-05-07T20:31:48.6546915Z     @given(
2025-05-07T20:31:48.6547037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6547217Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6547331Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6547452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6547566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6547637Z     )
2025-05-07T20:31:48.6547886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6547978Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6548053Z         self,
2025-05-07T20:31:48.6548131Z         T: int,
2025-05-07T20:31:48.6548205Z         D: int,
2025-05-07T20:31:48.6548304Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6548394Z         contiguous: bool,
2025-05-07T20:31:48.6548477Z         compiled: bool,
2025-05-07T20:31:48.6548558Z     ) -> None:
2025-05-07T20:31:48.6548658Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6548728Z     
2025-05-07T20:31:48.6548900Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6550741Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6550748Z 
2025-05-07T20:31:48.6550868Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6550873Z 
2025-05-07T20:31:48.6550976Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6551278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6551357Z     T=2048,
2025-05-07T20:31:48.6551430Z     D=5120,
2025-05-07T20:31:48.6551516Z     scale_ub=1200.0,
2025-05-07T20:31:48.6551601Z     contiguous=False,
2025-05-07T20:31:48.6551683Z     compiled=False,
2025-05-07T20:31:48.6551758Z )
2025-05-07T20:31:48.6551976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6552151Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.6552155Z 
2025-05-07T20:31:48.6552234Z     @given(
2025-05-07T20:31:48.6552350Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6552447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6552561Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6552677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6552796Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6552872Z     )
2025-05-07T20:31:48.6553120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6553219Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6553296Z         self,
2025-05-07T20:31:48.6553371Z         T: int,
2025-05-07T20:31:48.6553452Z         D: int,
2025-05-07T20:31:48.6553548Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6553638Z         contiguous: bool,
2025-05-07T20:31:48.6553725Z         compiled: bool,
2025-05-07T20:31:48.6553803Z     ) -> None:
2025-05-07T20:31:48.6553899Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6553974Z     
2025-05-07T20:31:48.6554142Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6556229Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6556377Z 
2025-05-07T20:31:48.6556501Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6556506Z 
2025-05-07T20:31:48.6556615Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6556841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6556917Z     T=4096,
2025-05-07T20:31:48.6556995Z     D=7168,
2025-05-07T20:31:48.6557076Z     scale_ub=1200.0,
2025-05-07T20:31:48.6557158Z     contiguous=True,
2025-05-07T20:31:48.6557243Z     compiled=False,
2025-05-07T20:31:48.6557314Z )
2025-05-07T20:31:48.6557532Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6557709Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.6557714Z 
2025-05-07T20:31:48.6557789Z     @given(
2025-05-07T20:31:48.6557916Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6558017Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6558132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6558252Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6558366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6558440Z     )
2025-05-07T20:31:48.6558690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6558783Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6558857Z         self,
2025-05-07T20:31:48.6558934Z         T: int,
2025-05-07T20:31:48.6559008Z         D: int,
2025-05-07T20:31:48.6559109Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6559197Z         contiguous: bool,
2025-05-07T20:31:48.6559394Z         compiled: bool,
2025-05-07T20:31:48.6559477Z     ) -> None:
2025-05-07T20:31:48.6559572Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6559643Z     
2025-05-07T20:31:48.6559821Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6561654Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6561661Z 
2025-05-07T20:31:48.6561780Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6561785Z 
2025-05-07T20:31:48.6561890Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6562118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6562203Z     T=16384,
2025-05-07T20:31:48.6562280Z     D=7168,
2025-05-07T20:31:48.6562362Z     scale_ub=None,
2025-05-07T20:31:48.6562449Z     contiguous=False,
2025-05-07T20:31:48.6562528Z     compiled=True,
2025-05-07T20:31:48.6562601Z )
2025-05-07T20:31:48.6562820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6562996Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:31:48.6563001Z 
2025-05-07T20:31:48.6563079Z     @given(
2025-05-07T20:31:48.6563195Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6563293Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6563414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6563532Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6563654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6563725Z     )
2025-05-07T20:31:48.6563977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6564156Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6564230Z         self,
2025-05-07T20:31:48.6564304Z         T: int,
2025-05-07T20:31:48.6564380Z         D: int,
2025-05-07T20:31:48.6564479Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6564566Z         contiguous: bool,
2025-05-07T20:31:48.6564655Z         compiled: bool,
2025-05-07T20:31:48.6564733Z     ) -> None:
2025-05-07T20:31:48.6564825Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6564900Z     
2025-05-07T20:31:48.6565066Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6566959Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6566969Z 
2025-05-07T20:31:48.6567088Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6567093Z 
2025-05-07T20:31:48.6567196Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6567419Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6567493Z     T=4096,
2025-05-07T20:31:48.6567569Z     D=7168,
2025-05-07T20:31:48.6567649Z     scale_ub=None,
2025-05-07T20:31:48.6567732Z     contiguous=True,
2025-05-07T20:31:48.6567817Z     compiled=False,
2025-05-07T20:31:48.6567888Z )
2025-05-07T20:31:48.6568183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6568359Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.6568369Z 
2025-05-07T20:31:48.6568443Z     @given(
2025-05-07T20:31:48.6568563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6568660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6568772Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6568890Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6569002Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6569072Z     )
2025-05-07T20:31:48.6569324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6569417Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6569493Z         self,
2025-05-07T20:31:48.6569572Z         T: int,
2025-05-07T20:31:48.6569645Z         D: int,
2025-05-07T20:31:48.6569743Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6569837Z         contiguous: bool,
2025-05-07T20:31:48.6569921Z         compiled: bool,
2025-05-07T20:31:48.6570005Z     ) -> None:
2025-05-07T20:31:48.6570102Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6570170Z     
2025-05-07T20:31:48.6570340Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6572174Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6572180Z 
2025-05-07T20:31:48.6572304Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6572309Z 
2025-05-07T20:31:48.6572411Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6572634Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6572819Z     T=16384,
2025-05-07T20:31:48.6572896Z     D=7168,
2025-05-07T20:31:48.6572977Z     scale_ub=None,
2025-05-07T20:31:48.6573059Z     contiguous=True,
2025-05-07T20:31:48.6573139Z     compiled=False,
2025-05-07T20:31:48.6573210Z )
2025-05-07T20:31:48.6573428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6573601Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:31:48.6573605Z 
2025-05-07T20:31:48.6573683Z     @given(
2025-05-07T20:31:48.6573796Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6573893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6574008Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6574127Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6574243Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6574319Z     )
2025-05-07T20:31:48.6574569Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6574664Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6574740Z         self,
2025-05-07T20:31:48.6574815Z         T: int,
2025-05-07T20:31:48.6574891Z         D: int,
2025-05-07T20:31:48.6574985Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6575071Z         contiguous: bool,
2025-05-07T20:31:48.6579878Z         compiled: bool,
2025-05-07T20:31:48.6579973Z     ) -> None:
2025-05-07T20:31:48.6580070Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6580144Z     
2025-05-07T20:31:48.6580322Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6582283Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6582294Z 
2025-05-07T20:31:48.6582416Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6582421Z 
2025-05-07T20:31:48.6582523Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6582754Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6582827Z     T=16384,
2025-05-07T20:31:48.6582905Z     D=7168,
2025-05-07T20:31:48.6582987Z     scale_ub=1200.0,
2025-05-07T20:31:48.6583068Z     contiguous=True,
2025-05-07T20:31:48.6583153Z     compiled=False,
2025-05-07T20:31:48.6583228Z )
2025-05-07T20:31:48.6583448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6583633Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.6583642Z 
2025-05-07T20:31:48.6583718Z     @given(
2025-05-07T20:31:48.6583836Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6583936Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6584051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6584172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6584288Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6584359Z     )
2025-05-07T20:31:48.6584614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6584708Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6584780Z         self,
2025-05-07T20:31:48.6584856Z         T: int,
2025-05-07T20:31:48.6584932Z         D: int,
2025-05-07T20:31:48.6585028Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6585119Z         contiguous: bool,
2025-05-07T20:31:48.6585201Z         compiled: bool,
2025-05-07T20:31:48.6585363Z     ) -> None:
2025-05-07T20:31:48.6585460Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6585529Z     
2025-05-07T20:31:48.6585702Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6587602Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6587608Z 
2025-05-07T20:31:48.6587729Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6587734Z 
2025-05-07T20:31:48.6587842Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6588068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6588149Z     T=128,
2025-05-07T20:31:48.6588223Z     D=5120,
2025-05-07T20:31:48.6588308Z     scale_ub=1200.0,
2025-05-07T20:31:48.6588395Z     contiguous=False,
2025-05-07T20:31:48.6588474Z     compiled=False,
2025-05-07T20:31:48.6588545Z )
2025-05-07T20:31:48.6588766Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6588937Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:48.6588941Z 
2025-05-07T20:31:48.6589017Z     @given(
2025-05-07T20:31:48.6589136Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6589233Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6589431Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6589547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6589663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6589737Z     )
2025-05-07T20:31:48.6589984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6590078Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6590150Z         self,
2025-05-07T20:31:48.6590222Z         T: int,
2025-05-07T20:31:48.6590300Z         D: int,
2025-05-07T20:31:48.6590394Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6590485Z         contiguous: bool,
2025-05-07T20:31:48.6590570Z         compiled: bool,
2025-05-07T20:31:48.6590646Z     ) -> None:
2025-05-07T20:31:48.6590739Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6590811Z     
2025-05-07T20:31:48.6590982Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6591052Z     
2025-05-07T20:31:48.6591152Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6591278Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6591370Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6591451Z         x0 = x[:, :D]
2025-05-07T20:31:48.6591529Z         x1 = x[:, D:]
2025-05-07T20:31:48.6591603Z     
2025-05-07T20:31:48.6591686Z         if contiguous:
2025-05-07T20:31:48.6591778Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6591868Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6591937Z     
2025-05-07T20:31:48.6592025Z         if scale_ub is not None:
2025-05-07T20:31:48.6592133Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6592268Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6592340Z             )
2025-05-07T20:31:48.6592419Z         else:
2025-05-07T20:31:48.6592512Z             scale_ub_tensor = None
2025-05-07T20:31:48.6592580Z     
2025-05-07T20:31:48.6592722Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6592808Z             op = silu_mul_quant
2025-05-07T20:31:48.6592898Z             if compiled:
2025-05-07T20:31:48.6592995Z                 op = torch.compile(op)
2025-05-07T20:31:48.6593184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6593257Z     
2025-05-07T20:31:48.6593347Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6593351Z 
2025-05-07T20:31:48.6593446Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6593577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6593677Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6593776Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6594291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6594387Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6594761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6594988Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6595342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6595434Z     kernel = self.compile(
2025-05-07T20:31:48.6595826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6596009Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6596133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6596138Z 
2025-05-07T20:31:48.6596355Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e61e5f0>
2025-05-07T20:31:48.6597274Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6597790Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50e6cdcf0>}
2025-05-07T20:31:48.6598563Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6598760Z context = <triton._C.libtriton.ir.context object at 0x7fc50e43c3f0>
2025-05-07T20:31:48.6598765Z 
2025-05-07T20:31:48.6598931Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6599204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6599311Z                            module_map=module_map)
2025-05-07T20:31:48.6599479Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6599581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6599654Z E       ^
2025-05-07T20:31:48.6600021Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6600031Z 
2025-05-07T20:31:48.6600452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6600457Z 
2025-05-07T20:31:48.6600560Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6600787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6600861Z     T=2048,
2025-05-07T20:31:48.6600933Z     D=7168,
2025-05-07T20:31:48.6601016Z     scale_ub=None,
2025-05-07T20:31:48.6601099Z     contiguous=False,
2025-05-07T20:31:48.6601183Z     compiled=False,
2025-05-07T20:31:48.6601252Z )
2025-05-07T20:31:48.6601476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6601654Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:31:48.6601658Z 
2025-05-07T20:31:48.6601827Z     @given(
2025-05-07T20:31:48.6601947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6602046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6602159Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6602275Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6602394Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6602465Z     )
2025-05-07T20:31:48.6602716Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6602808Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6602882Z         self,
2025-05-07T20:31:48.6602958Z         T: int,
2025-05-07T20:31:48.6603029Z         D: int,
2025-05-07T20:31:48.6603124Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6603221Z         contiguous: bool,
2025-05-07T20:31:48.6603304Z         compiled: bool,
2025-05-07T20:31:48.6603381Z     ) -> None:
2025-05-07T20:31:48.6603476Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6603551Z     
2025-05-07T20:31:48.6603720Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6605569Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6605575Z 
2025-05-07T20:31:48.6605695Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6605776Z 
2025-05-07T20:31:48.6605879Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6606105Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6606187Z     T=128,
2025-05-07T20:31:48.6606260Z     D=7168,
2025-05-07T20:31:48.6606339Z     scale_ub=1200.0,
2025-05-07T20:31:48.6606434Z     contiguous=True,
2025-05-07T20:31:48.6606529Z     compiled=True,
2025-05-07T20:31:48.6606601Z )
2025-05-07T20:31:48.6606844Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6607011Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.6607015Z 
2025-05-07T20:31:48.6607091Z     @given(
2025-05-07T20:31:48.6607208Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6607304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6607420Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6607539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6607653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6607726Z     )
2025-05-07T20:31:48.6607974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6608072Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6608146Z         self,
2025-05-07T20:31:48.6608220Z         T: int,
2025-05-07T20:31:48.6608293Z         D: int,
2025-05-07T20:31:48.6608394Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6608482Z         contiguous: bool,
2025-05-07T20:31:48.6608568Z         compiled: bool,
2025-05-07T20:31:48.6608643Z     ) -> None:
2025-05-07T20:31:48.6608735Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6608808Z     
2025-05-07T20:31:48.6608977Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6609049Z     
2025-05-07T20:31:48.6609143Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6609272Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6609358Z         x = x_sign * x_clamp
2025-05-07T20:31:48.6609437Z         x0 = x[:, :D]
2025-05-07T20:31:48.6609514Z         x1 = x[:, D:]
2025-05-07T20:31:48.6609690Z     
2025-05-07T20:31:48.6609773Z         if contiguous:
2025-05-07T20:31:48.6609865Z             x0 = x0.contiguous()
2025-05-07T20:31:48.6609955Z             x1 = x1.contiguous()
2025-05-07T20:31:48.6610026Z     
2025-05-07T20:31:48.6610116Z         if scale_ub is not None:
2025-05-07T20:31:48.6610222Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:48.6610358Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:48.6610428Z             )
2025-05-07T20:31:48.6610506Z         else:
2025-05-07T20:31:48.6610597Z             scale_ub_tensor = None
2025-05-07T20:31:48.6610665Z     
2025-05-07T20:31:48.6610800Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:48.6610887Z             op = silu_mul_quant
2025-05-07T20:31:48.6610969Z             if compiled:
2025-05-07T20:31:48.6611074Z                 op = torch.compile(op)
2025-05-07T20:31:48.6611179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6611260Z     
2025-05-07T20:31:48.6611349Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:48.6611353Z 
2025-05-07T20:31:48.6611450Z moe/activation_test.py:117: 
2025-05-07T20:31:48.6611584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6611683Z moe/activation_test.py:115: in fn
2025-05-07T20:31:48.6611782Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:48.6612162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:31:48.6612253Z     return fn(*args, **kwargs)
2025-05-07T20:31:48.6612758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:48.6612858Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:48.6613302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:48.6613530Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:48.6613881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:48.6613975Z     kernel = self.compile(
2025-05-07T20:31:48.6614372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:48.6614549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:48.6614674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:48.6614678Z 
2025-05-07T20:31:48.6614887Z self = <triton.compiler.compiler.ASTSource object at 0x7fc50e4d1720>
2025-05-07T20:31:48.6615688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:48.6616208Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fc57298ec20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fc50e6cf0a0>}
2025-05-07T20:31:48.6616971Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:48.6617168Z context = <triton._C.libtriton.ir.context object at 0x7fc50e48f1f0>
2025-05-07T20:31:48.6617173Z 
2025-05-07T20:31:48.6617338Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:48.6617608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:48.6617722Z                            module_map=module_map)
2025-05-07T20:31:48.6617887Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:48.6617988Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:48.6618200Z E       ^
2025-05-07T20:31:48.6618564Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:48.6618568Z 
2025-05-07T20:31:48.6618993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:48.6618998Z 
2025-05-07T20:31:48.6619101Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6619331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6619405Z     T=128,
2025-05-07T20:31:48.6619479Z     D=7168,
2025-05-07T20:31:48.6619562Z     scale_ub=1200.0,
2025-05-07T20:31:48.6619646Z     contiguous=True,
2025-05-07T20:31:48.6619727Z     compiled=False,
2025-05-07T20:31:48.6619804Z )
2025-05-07T20:31:48.6620034Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6620205Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:48.6620215Z 
2025-05-07T20:31:48.6620291Z     @given(
2025-05-07T20:31:48.6620410Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6620513Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6620626Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6620747Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6620865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6620936Z     )
2025-05-07T20:31:48.6621186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6621283Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6621359Z         self,
2025-05-07T20:31:48.6621435Z         T: int,
2025-05-07T20:31:48.6621512Z         D: int,
2025-05-07T20:31:48.6621687Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6621776Z         contiguous: bool,
2025-05-07T20:31:48.6621863Z         compiled: bool,
2025-05-07T20:31:48.6621939Z     ) -> None:
2025-05-07T20:31:48.6622041Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6622111Z     
2025-05-07T20:31:48.6622282Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6622359Z     
2025-05-07T20:31:48.6622449Z         x_sign = torch.sign(x)
2025-05-07T20:31:48.6622575Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:48.6624422Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6624428Z 
2025-05-07T20:31:48.6624549Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:31:48.6624553Z 
2025-05-07T20:31:48.6624659Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6624885Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6624956Z     T=128,
2025-05-07T20:31:48.6625032Z     D=5120,
2025-05-07T20:31:48.6625112Z     scale_ub=1200.0,
2025-05-07T20:31:48.6625197Z     contiguous=True,
2025-05-07T20:31:48.6625277Z     compiled=True,
2025-05-07T20:31:48.6625346Z )
2025-05-07T20:31:48.6625566Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6625733Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:48.6625737Z 
2025-05-07T20:31:48.6625809Z     @given(
2025-05-07T20:31:48.6625936Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6626032Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6626145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6626353Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6626485Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6626568Z     )
2025-05-07T20:31:48.6626830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6626924Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6627000Z         self,
2025-05-07T20:31:48.6627071Z         T: int,
2025-05-07T20:31:48.6627145Z         D: int,
2025-05-07T20:31:48.6627243Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6627330Z         contiguous: bool,
2025-05-07T20:31:48.6627413Z         compiled: bool,
2025-05-07T20:31:48.6627490Z     ) -> None:
2025-05-07T20:31:48.6627583Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6627652Z     
2025-05-07T20:31:48.6627828Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6627899Z     
2025-05-07T20:31:48.6627991Z >       x_sign = torch.sign(x)
2025-05-07T20:31:48.6629829Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6629834Z 
2025-05-07T20:31:48.6629958Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:31:48.6629962Z 
2025-05-07T20:31:48.6630064Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:48.6630373Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:48.6630453Z     T=128,
2025-05-07T20:31:48.6630524Z     D=7168,
2025-05-07T20:31:48.6630605Z     scale_ub=None,
2025-05-07T20:31:48.6630689Z     contiguous=True,
2025-05-07T20:31:48.6630769Z     compiled=True,
2025-05-07T20:31:48.6630838Z )
2025-05-07T20:31:48.6631059Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:48.6631224Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:48.6631229Z 
2025-05-07T20:31:48.6631305Z     @given(
2025-05-07T20:31:48.6631422Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:48.6631517Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:48.6631638Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:48.6631755Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:48.6631866Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:48.6631941Z     )
2025-05-07T20:31:48.6632191Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:48.6632283Z     def test_silu_mul_quant(
2025-05-07T20:31:48.6632366Z         self,
2025-05-07T20:31:48.6632439Z         T: int,
2025-05-07T20:31:48.6632516Z         D: int,
2025-05-07T20:31:48.6632611Z         scale_ub: Optional[float],
2025-05-07T20:31:48.6632696Z         contiguous: bool,
2025-05-07T20:31:48.6632784Z         compiled: bool,
2025-05-07T20:31:48.6632859Z     ) -> None:
2025-05-07T20:31:48.6632952Z         torch.manual_seed(2025)
2025-05-07T20:31:48.6633026Z     
2025-05-07T20:31:48.6633193Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:48.6635024Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:31:48.6635111Z 
2025-05-07T20:31:48.6635229Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:31:48.6635360Z =============================== warnings summary ===============================
2025-05-07T20:31:48.6635677Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:48.6635983Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:48.6636292Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:31:48.6637246Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:31:48.6637481Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:31:48.6637489Z 
2025-05-07T20:31:48.6637667Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:31:48.6638967Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:31:48.6639159Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:31:48.6639164Z 
2025-05-07T20:31:48.6639471Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:31:48.6639639Z ================== 1 failed, 1 passed, 13 warnings in 29.66s ===================
2025-05-07T20:31:50.3886825Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:31:50.4503895Z 
2025-05-07T20:31:50.4504467Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:31:50.4504842Z 
2025-05-07T20:31:50.4504846Z 
2025-05-07T20:31:50.4525339Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:31:52.6080002Z ============================= test session starts ==============================
2025-05-07T20:31:52.6080676Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:52.6081216Z cachedir: .pytest_cache
2025-05-07T20:31:52.6081810Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:52.6082556Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:52.6082977Z plugins: hypothesis-6.131.14
2025-05-07T20:31:54.2148914Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:54.3936282Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:31:54.3936696Z run-last-failure: rerun previous 1 failure
2025-05-07T20:31:54.3936928Z 
2025-05-07T20:31:56.5185695Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.5186841Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:56.5188218Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.5190137Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.5191554Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.5192974Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.5194308Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.5195717Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.5197166Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.5198489Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:56.5199886Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.5201130Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:56.5202181Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:56.5203220Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:56.5204465Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.5205781Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.5206924Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:56.5208005Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:56.5209244Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.5210628Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.5211713Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.5212638Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.5213482Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:56.5214526Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.5354689Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.5355975Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:31:56.5357366Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.5358900Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.5360310Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.5361725Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.5363305Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.5364723Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.5366166Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.5367436Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:31:56.5368686Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.5369920Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:31:56.5370980Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:56.5372015Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:31:56.5373256Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.5374570Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.5375703Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:56.5376892Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:31:56.5378196Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.5379578Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.5380672Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.5381604Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.5382357Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:31:56.5383390Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.1021847Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.1022564Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.1022986Z     T=1,
2025-05-07T20:31:57.1023192Z     D=5120,
2025-05-07T20:31:57.1023401Z     scale_ub=None,
2025-05-07T20:31:57.1023625Z     contiguous=True,
2025-05-07T20:31:57.1024257Z     compiled=True,
2025-05-07T20:31:57.1024485Z )
2025-05-07T20:31:57.1024821Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:57.1025340Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:57.1025616Z 
2025-05-07T20:31:57.1025707Z     @given(
2025-05-07T20:31:57.1025950Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:57.1026282Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:57.1026603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:57.1026950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:57.1027287Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:57.1027588Z     )
2025-05-07T20:31:57.1027955Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:57.1028413Z     def test_silu_mul_quant(
2025-05-07T20:31:57.1028670Z         self,
2025-05-07T20:31:57.1028903Z         T: int,
2025-05-07T20:31:57.1029132Z         D: int,
2025-05-07T20:31:57.1029364Z         scale_ub: Optional[float],
2025-05-07T20:31:57.1029650Z         contiguous: bool,
2025-05-07T20:31:57.1029904Z         compiled: bool,
2025-05-07T20:31:57.1030149Z     ) -> None:
2025-05-07T20:31:57.1030377Z         torch.manual_seed(2025)
2025-05-07T20:31:57.1030625Z     
2025-05-07T20:31:57.1030913Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:57.1031271Z     
2025-05-07T20:31:57.1031483Z         x_sign = torch.sign(x)
2025-05-07T20:31:57.1031788Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:57.1032111Z         x = x_sign * x_clamp
2025-05-07T20:31:57.1032359Z         x0 = x[:, :D]
2025-05-07T20:31:57.1032580Z         x1 = x[:, D:]
2025-05-07T20:31:57.1032800Z     
2025-05-07T20:31:57.1032998Z         if contiguous:
2025-05-07T20:31:57.1033238Z             x0 = x0.contiguous()
2025-05-07T20:31:57.1033509Z             x1 = x1.contiguous()
2025-05-07T20:31:57.1033766Z     
2025-05-07T20:31:57.1033964Z         if scale_ub is not None:
2025-05-07T20:31:57.1034251Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:57.1034767Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:57.1035082Z             )
2025-05-07T20:31:57.1035288Z         else:
2025-05-07T20:31:57.1035510Z             scale_ub_tensor = None
2025-05-07T20:31:57.1035770Z     
2025-05-07T20:31:57.1036020Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.1036347Z             op = silu_mul_quant
2025-05-07T20:31:57.1036607Z             if compiled:
2025-05-07T20:31:57.1036858Z                 op = torch.compile(op)
2025-05-07T20:31:57.1037169Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:57.1037457Z     
2025-05-07T20:31:57.1037653Z         y_fp8, y_scale = fn()
2025-05-07T20:31:57.1037957Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:57.1038257Z     
2025-05-07T20:31:57.1038507Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:57.1038856Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:57.1039161Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:57.1039489Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:57.1039859Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.1040182Z     
2025-05-07T20:31:57.1040388Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:57.1040597Z 
2025-05-07T20:31:57.1040703Z moe/activation_test.py:126: 
2025-05-07T20:31:57.1041009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1041354Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:57.1041690Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:57.1042501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:57.1043360Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:57.1043925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:57.1044627Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:57.1045335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:57.1046076Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.1046841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:57.1047606Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:57.1048354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:57.1049065Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:57.1049674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:57.1050212Z     fn()
2025-05-07T20:31:57.1050736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:57.1051329Z     self.fn.run(
2025-05-07T20:31:57.1051804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:57.1052352Z     kernel = self.compile(
2025-05-07T20:31:57.1052910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:57.1053574Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.1053979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:57.1054214Z 
2025-05-07T20:31:57.1054436Z self = <triton.compiler.compiler.ASTSource object at 0x7facc39e8040>
2025-05-07T20:31:57.1056103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:57.1058380Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc8283400>}
2025-05-07T20:31:57.1060121Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:57.1061172Z context = <triton._C.libtriton.ir.context object at 0x7facc9f914b0>
2025-05-07T20:31:57.1061475Z 
2025-05-07T20:31:57.1061646Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:57.1062188Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.1062663Z                            module_map=module_map)
2025-05-07T20:31:57.1063045Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.1063417Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:57.1063692Z E       ^
2025-05-07T20:31:57.1064164Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.1064630Z 
2025-05-07T20:31:57.1065054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:57.1065574Z 
2025-05-07T20:31:57.1065686Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:57.1066103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:57.1066510Z     T=2048,
2025-05-07T20:31:57.1066703Z     D=5120,
2025-05-07T20:31:57.1067030Z     scale_ub=1200.0,
2025-05-07T20:31:57.1067256Z     contiguous=True,
2025-05-07T20:31:57.1067497Z     compiled=False,
2025-05-07T20:31:57.1067706Z )
2025-05-07T20:31:58.0251254Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:58.0252391Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:58.0253772Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:58.0255263Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:58.0256900Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:58.0258381Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.0259771Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:58.0261182Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.0262629Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:58.0264254Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:58.0265499Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:58.0266735Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:58.0267796Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:58.0268830Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:58.0270080Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:58.0271386Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:58.0272525Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:58.0273591Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:58.0274923Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:58.0276315Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:58.0277396Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.0278327Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.0279125Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:58.0280175Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:58.2317956Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:58.2319054Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:31:58.2320420Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:58.2321879Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:58.2323302Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:58.2324862Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:58.2326198Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:58.2327600Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:58.2329108Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:58.2330391Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:31:58.2331646Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:58.2332880Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:31:58.2333933Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:58.2335095Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:31:58.2336340Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:58.2337655Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:58.2338883Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:58.2339993Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:31:58.2341203Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:58.2342598Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:58.2343684Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:58.2344620Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:58.2345371Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:31:58.2346416Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.0042402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.0043163Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:59.0043462Z 
2025-05-07T20:31:59.0043548Z     @given(
2025-05-07T20:31:59.0043803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.0044129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.0044447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.0044795Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.0045136Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.0045427Z     )
2025-05-07T20:31:59.0045793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.0046253Z     def test_silu_mul_quant(
2025-05-07T20:31:59.0046499Z         self,
2025-05-07T20:31:59.0046706Z         T: int,
2025-05-07T20:31:59.0046916Z         D: int,
2025-05-07T20:31:59.0047139Z         scale_ub: Optional[float],
2025-05-07T20:31:59.0047423Z         contiguous: bool,
2025-05-07T20:31:59.0047678Z         compiled: bool,
2025-05-07T20:31:59.0047909Z     ) -> None:
2025-05-07T20:31:59.0048143Z         torch.manual_seed(2025)
2025-05-07T20:31:59.0048392Z     
2025-05-07T20:31:59.0048672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.0049029Z     
2025-05-07T20:31:59.0049234Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.0049539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.0049862Z         x = x_sign * x_clamp
2025-05-07T20:31:59.0050111Z         x0 = x[:, :D]
2025-05-07T20:31:59.0050338Z         x1 = x[:, D:]
2025-05-07T20:31:59.0050549Z     
2025-05-07T20:31:59.0050745Z         if contiguous:
2025-05-07T20:31:59.0050988Z             x0 = x0.contiguous()
2025-05-07T20:31:59.0051253Z             x1 = x1.contiguous()
2025-05-07T20:31:59.0051622Z     
2025-05-07T20:31:59.0051827Z         if scale_ub is not None:
2025-05-07T20:31:59.0052109Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.0052470Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.0052789Z             )
2025-05-07T20:31:59.0052990Z         else:
2025-05-07T20:31:59.0053212Z             scale_ub_tensor = None
2025-05-07T20:31:59.0053480Z     
2025-05-07T20:31:59.0053720Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.0054047Z             op = silu_mul_quant
2025-05-07T20:31:59.0054307Z             if compiled:
2025-05-07T20:31:59.0054561Z                 op = torch.compile(op)
2025-05-07T20:31:59.0054911Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.0055191Z     
2025-05-07T20:31:59.0055395Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:59.0055727Z 
2025-05-07T20:31:59.0055843Z moe/activation_test.py:117: 
2025-05-07T20:31:59.0056149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.0056491Z moe/activation_test.py:115: in fn
2025-05-07T20:31:59.0056785Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.0057508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:59.0058273Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:59.0058829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.0059582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.0060257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.0060806Z     kernel = self.compile(
2025-05-07T20:31:59.0061366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.0062043Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.0062447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.0062813Z 
2025-05-07T20:31:59.0063029Z self = <triton.compiler.compiler.ASTSource object at 0x7facc3fbe4a0>
2025-05-07T20:31:59.0064139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.0065550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc3962e60>}
2025-05-07T20:31:59.0066930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.0067976Z context = <triton._C.libtriton.ir.context object at 0x7facc2abcd70>
2025-05-07T20:31:59.0068277Z 
2025-05-07T20:31:59.0068453Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.0068990Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.0069521Z                            module_map=module_map)
2025-05-07T20:31:59.0069897Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.0070261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.0070530Z E       ^
2025-05-07T20:31:59.0071002Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.0071467Z 
2025-05-07T20:31:59.0071891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.0072412Z 
2025-05-07T20:31:59.0072669Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.0073101Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.0073512Z     T=2048,
2025-05-07T20:31:59.0073709Z     D=5120,
2025-05-07T20:31:59.0073911Z     scale_ub=1200.0,
2025-05-07T20:31:59.0074135Z     contiguous=True,
2025-05-07T20:31:59.0074362Z     compiled=True,
2025-05-07T20:31:59.0074575Z )
2025-05-07T20:31:59.0074901Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:59.0075405Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:59.0075680Z 
2025-05-07T20:31:59.0075765Z     @given(
2025-05-07T20:31:59.0075998Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:59.0076321Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:59.0076638Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:59.0076982Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:59.0077322Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:59.0077616Z     )
2025-05-07T20:31:59.0077977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:59.0078429Z     def test_silu_mul_quant(
2025-05-07T20:31:59.0078678Z         self,
2025-05-07T20:31:59.0078882Z         T: int,
2025-05-07T20:31:59.0079080Z         D: int,
2025-05-07T20:31:59.0079306Z         scale_ub: Optional[float],
2025-05-07T20:31:59.0079585Z         contiguous: bool,
2025-05-07T20:31:59.0079831Z         compiled: bool,
2025-05-07T20:31:59.0080063Z     ) -> None:
2025-05-07T20:31:59.0080285Z         torch.manual_seed(2025)
2025-05-07T20:31:59.0080529Z     
2025-05-07T20:31:59.0080810Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:59.0081161Z     
2025-05-07T20:31:59.0081356Z         x_sign = torch.sign(x)
2025-05-07T20:31:59.0081659Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:59.0081981Z         x = x_sign * x_clamp
2025-05-07T20:31:59.0082227Z         x0 = x[:, :D]
2025-05-07T20:31:59.0082444Z         x1 = x[:, D:]
2025-05-07T20:31:59.0082654Z     
2025-05-07T20:31:59.0082934Z         if contiguous:
2025-05-07T20:31:59.0083166Z             x0 = x0.contiguous()
2025-05-07T20:31:59.0083430Z             x1 = x1.contiguous()
2025-05-07T20:31:59.0083678Z     
2025-05-07T20:31:59.0083870Z         if scale_ub is not None:
2025-05-07T20:31:59.0084151Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:59.0084501Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:59.0084814Z             )
2025-05-07T20:31:59.0085012Z         else:
2025-05-07T20:31:59.0085227Z             scale_ub_tensor = None
2025-05-07T20:31:59.0085479Z     
2025-05-07T20:31:59.0085718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.0086037Z             op = silu_mul_quant
2025-05-07T20:31:59.0086287Z             if compiled:
2025-05-07T20:31:59.0086545Z                 op = torch.compile(op)
2025-05-07T20:31:59.0086849Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:59.0087127Z     
2025-05-07T20:31:59.0087345Z         y_fp8, y_scale = fn()
2025-05-07T20:31:59.0087638Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:59.0087935Z     
2025-05-07T20:31:59.0088186Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:59.0088527Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:59.0088828Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:59.0089173Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:59.0089574Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:59.0089888Z     
2025-05-07T20:31:59.0090100Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:59.0090302Z 
2025-05-07T20:31:59.0090411Z moe/activation_test.py:126: 
2025-05-07T20:31:59.0090711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.0091186Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:59.0091530Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:59.0092336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:59.0093113Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:59.0093672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:59.0094370Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:59.0095070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:59.0095811Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:59.0096587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:59.0097356Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:59.0098182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:59.0098836Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:59.0099502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:59.0100030Z     fn()
2025-05-07T20:31:59.0100543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:59.0101134Z     self.fn.run(
2025-05-07T20:31:59.0101616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:59.0102152Z     kernel = self.compile(
2025-05-07T20:31:59.0102715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:59.0103385Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.0103875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:59.0104103Z 
2025-05-07T20:31:59.0104317Z self = <triton.compiler.compiler.ASTSource object at 0x7facc2a9d7b0>
2025-05-07T20:31:59.0105423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:59.0106829Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc243d6c0>}
2025-05-07T20:31:59.0108207Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:59.0109247Z context = <triton._C.libtriton.ir.context object at 0x7facc22b03b0>
2025-05-07T20:31:59.0109555Z 
2025-05-07T20:31:59.0109724Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:59.0110261Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.0110743Z                            module_map=module_map)
2025-05-07T20:31:59.0111110Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.0111476Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:59.0111757Z E       ^
2025-05-07T20:31:59.0112229Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.0112689Z 
2025-05-07T20:31:59.0113193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:59.0113724Z 
2025-05-07T20:31:59.0113833Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:59.0114263Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:59.0114667Z     T=16384,
2025-05-07T20:31:59.0114867Z     D=7168,
2025-05-07T20:31:59.0115069Z     scale_ub=1200.0,
2025-05-07T20:31:59.0115295Z     contiguous=False,
2025-05-07T20:31:59.0115529Z     compiled=False,
2025-05-07T20:31:59.0115744Z )
2025-05-07T20:31:59.5706574Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.5708778Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:31:59.5710542Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.5712032Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.5713460Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.5714903Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.5716267Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.5717698Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.5719345Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.5720633Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:31:59.5721903Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.5723165Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:31:59.5724249Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:59.5725315Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:31:59.5726580Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.5727913Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.5729198Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:59.5730348Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:31:59.5731566Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.5732957Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.5734046Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.5734995Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.5735758Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:31:59.5736808Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.7277276Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.7278365Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:31:59.7279745Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.7281203Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.7282816Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.7284226Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.7285569Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.7286988Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.7288448Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.7289780Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:31:59.7291033Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.7292378Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:31:59.7293448Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:31:59.7294500Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:31:59.7295757Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.7297070Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.7298292Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:31:59.7299372Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:31:59.7300640Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.7302035Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.7303116Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.7304058Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.7304822Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:31:59.7305949Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.7806872Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.7807421Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:00.7807720Z 
2025-05-07T20:32:00.7807806Z     @given(
2025-05-07T20:32:00.7808089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.7808425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.7808736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.7809082Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.7809509Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.7810112Z     )
2025-05-07T20:32:00.7810829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.7811746Z     def test_silu_mul_quant(
2025-05-07T20:32:00.7812227Z         self,
2025-05-07T20:32:00.7812618Z         T: int,
2025-05-07T20:32:00.7813021Z         D: int,
2025-05-07T20:32:00.7813455Z         scale_ub: Optional[float],
2025-05-07T20:32:00.7814012Z         contiguous: bool,
2025-05-07T20:32:00.7814493Z         compiled: bool,
2025-05-07T20:32:00.7814948Z     ) -> None:
2025-05-07T20:32:00.7815375Z         torch.manual_seed(2025)
2025-05-07T20:32:00.7815867Z     
2025-05-07T20:32:00.7816427Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.7817113Z     
2025-05-07T20:32:00.7817507Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.7818186Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.7819218Z         x = x_sign * x_clamp
2025-05-07T20:32:00.7819607Z         x0 = x[:, :D]
2025-05-07T20:32:00.7819868Z         x1 = x[:, D:]
2025-05-07T20:32:00.7820075Z     
2025-05-07T20:32:00.7820273Z         if contiguous:
2025-05-07T20:32:00.7820512Z             x0 = x0.contiguous()
2025-05-07T20:32:00.7820778Z             x1 = x1.contiguous()
2025-05-07T20:32:00.7821026Z     
2025-05-07T20:32:00.7821226Z         if scale_ub is not None:
2025-05-07T20:32:00.7821502Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.7821848Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.7822163Z             )
2025-05-07T20:32:00.7822362Z         else:
2025-05-07T20:32:00.7822573Z             scale_ub_tensor = None
2025-05-07T20:32:00.7822834Z     
2025-05-07T20:32:00.7823074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.7823391Z             op = silu_mul_quant
2025-05-07T20:32:00.7823648Z             if compiled:
2025-05-07T20:32:00.7823906Z                 op = torch.compile(op)
2025-05-07T20:32:00.7824210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7824491Z     
2025-05-07T20:32:00.7824703Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.7824871Z 
2025-05-07T20:32:00.7824974Z moe/activation_test.py:117: 
2025-05-07T20:32:00.7825277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7825614Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.7825900Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7826620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.7827332Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.7827891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.7828589Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.7829283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.7829835Z     kernel = self.compile(
2025-05-07T20:32:00.7830530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.7831200Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.7831609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7831837Z 
2025-05-07T20:32:00.7832061Z self = <triton.compiler.compiler.ASTSource object at 0x7facc23d68c0>
2025-05-07T20:32:00.7833181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.7834598Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc243d510>}
2025-05-07T20:32:00.7835989Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.7837049Z context = <triton._C.libtriton.ir.context object at 0x7facc22e4230>
2025-05-07T20:32:00.7837344Z 
2025-05-07T20:32:00.7837519Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.7838048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.7838533Z                            module_map=module_map)
2025-05-07T20:32:00.7838910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.7839274Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.7839556Z E       ^
2025-05-07T20:32:00.7840143Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.7840607Z 
2025-05-07T20:32:00.7841047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.7841576Z 
2025-05-07T20:32:00.7841682Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.7842119Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.7842535Z     T=1,
2025-05-07T20:32:00.7850272Z     D=7168,
2025-05-07T20:32:00.7850490Z     scale_ub=None,
2025-05-07T20:32:00.7850718Z     contiguous=True,
2025-05-07T20:32:00.7850947Z     compiled=True,
2025-05-07T20:32:00.7851169Z )
2025-05-07T20:32:00.7851510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.7852004Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:00.7852272Z 
2025-05-07T20:32:00.7852357Z     @given(
2025-05-07T20:32:00.7852598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.7852920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.7853234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.7853570Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.7853904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.7854191Z     )
2025-05-07T20:32:00.7854553Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.7855012Z     def test_silu_mul_quant(
2025-05-07T20:32:00.7855254Z         self,
2025-05-07T20:32:00.7855454Z         T: int,
2025-05-07T20:32:00.7856011Z         D: int,
2025-05-07T20:32:00.7856229Z         scale_ub: Optional[float],
2025-05-07T20:32:00.7856508Z         contiguous: bool,
2025-05-07T20:32:00.7856758Z         compiled: bool,
2025-05-07T20:32:00.7856980Z     ) -> None:
2025-05-07T20:32:00.7857206Z         torch.manual_seed(2025)
2025-05-07T20:32:00.7857464Z     
2025-05-07T20:32:00.7857750Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.7858149Z     
2025-05-07T20:32:00.7858532Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.7858834Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.7859144Z         x = x_sign * x_clamp
2025-05-07T20:32:00.7859386Z         x0 = x[:, :D]
2025-05-07T20:32:00.7859610Z         x1 = x[:, D:]
2025-05-07T20:32:00.7859818Z     
2025-05-07T20:32:00.7860010Z         if contiguous:
2025-05-07T20:32:00.7860253Z             x0 = x0.contiguous()
2025-05-07T20:32:00.7860510Z             x1 = x1.contiguous()
2025-05-07T20:32:00.7860754Z     
2025-05-07T20:32:00.7860960Z         if scale_ub is not None:
2025-05-07T20:32:00.7861234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.7861576Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.7861889Z             )
2025-05-07T20:32:00.7862086Z         else:
2025-05-07T20:32:00.7862304Z             scale_ub_tensor = None
2025-05-07T20:32:00.7862559Z     
2025-05-07T20:32:00.7862793Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.7863118Z             op = silu_mul_quant
2025-05-07T20:32:00.7863372Z             if compiled:
2025-05-07T20:32:00.7863624Z                 op = torch.compile(op)
2025-05-07T20:32:00.7863941Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.7864218Z     
2025-05-07T20:32:00.7864420Z         y_fp8, y_scale = fn()
2025-05-07T20:32:00.7864703Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:00.7864995Z     
2025-05-07T20:32:00.7865242Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.7865583Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:00.7865874Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:00.7866190Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:00.7866672Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.7866988Z     
2025-05-07T20:32:00.7867202Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:00.7867408Z 
2025-05-07T20:32:00.7867510Z moe/activation_test.py:126: 
2025-05-07T20:32:00.7867808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7868145Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:00.7868468Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.7869271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:00.7870088Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:00.7870646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.7871333Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.7872041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:00.7872774Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.7873541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:00.7874299Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.7875036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:00.7875684Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:00.7876291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:00.7876810Z     fn()
2025-05-07T20:32:00.7877332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:00.7877921Z     self.fn.run(
2025-05-07T20:32:00.7878388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.7879008Z     kernel = self.compile(
2025-05-07T20:32:00.7879560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.7880277Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.7880670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.7880906Z 
2025-05-07T20:32:00.7881117Z self = <triton.compiler.compiler.ASTSource object at 0x7facc23d5c60>
2025-05-07T20:32:00.7882224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.7883627Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc243d7e0>}
2025-05-07T20:32:00.7884999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.7886047Z context = <triton._C.libtriton.ir.context object at 0x7fac9dc765f0>
2025-05-07T20:32:00.7886346Z 
2025-05-07T20:32:00.7886515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.7887051Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.7887519Z                            module_map=module_map)
2025-05-07T20:32:00.7887901Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.7888339Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:00.7888605Z E       ^
2025-05-07T20:32:00.7889074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.7889540Z 
2025-05-07T20:32:00.7889960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.7890478Z 
2025-05-07T20:32:00.7890590Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.7891008Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.7891414Z     T=4096,
2025-05-07T20:32:00.7891607Z     D=5120,
2025-05-07T20:32:00.7891805Z     scale_ub=None,
2025-05-07T20:32:00.7892024Z     contiguous=False,
2025-05-07T20:32:00.7892257Z     compiled=False,
2025-05-07T20:32:00.7892466Z )
2025-05-07T20:32:01.3800835Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:01.3803009Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:01.3805746Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:01.3808645Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:01.3810590Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:01.3812005Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.3813501Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.3814910Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.3816358Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:01.3817633Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:01.3818946Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:01.3820239Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:01.3821294Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:01.3822335Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:01.3823688Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:01.3824998Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:01.3826137Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:01.3827203Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:01.3828398Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:01.3829784Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:01.3830866Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.3831804Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.3832559Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:01.3833594Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9856690Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:01.9857800Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:01.9859515Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:01.9861021Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:01.9862437Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:01.9863853Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9865185Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.9866596Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9868042Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:01.9869433Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:01.9870736Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:01.9871980Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:01.9873041Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:01.9874074Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:01.9875330Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:01.9876639Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:01.9877784Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:01.9878847Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:01.9880040Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:01.9881422Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:01.9882501Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9883512Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9884266Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:01.9885300Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.1577124Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.1577672Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:03.1578429Z 
2025-05-07T20:32:03.1578635Z     @given(
2025-05-07T20:32:03.1578885Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.1579227Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.1579603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.1579957Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.1580356Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.1580654Z     )
2025-05-07T20:32:03.1581027Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.1581498Z     def test_silu_mul_quant(
2025-05-07T20:32:03.1581749Z         self,
2025-05-07T20:32:03.1581957Z         T: int,
2025-05-07T20:32:03.1582169Z         D: int,
2025-05-07T20:32:03.1582396Z         scale_ub: Optional[float],
2025-05-07T20:32:03.1582686Z         contiguous: bool,
2025-05-07T20:32:03.1582941Z         compiled: bool,
2025-05-07T20:32:03.1583176Z     ) -> None:
2025-05-07T20:32:03.1583408Z         torch.manual_seed(2025)
2025-05-07T20:32:03.1583851Z     
2025-05-07T20:32:03.1584137Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.1584492Z     
2025-05-07T20:32:03.1584709Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.1585014Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.1585329Z         x = x_sign * x_clamp
2025-05-07T20:32:03.1585581Z         x0 = x[:, :D]
2025-05-07T20:32:03.1585808Z         x1 = x[:, D:]
2025-05-07T20:32:03.1586018Z     
2025-05-07T20:32:03.1586216Z         if contiguous:
2025-05-07T20:32:03.1586460Z             x0 = x0.contiguous()
2025-05-07T20:32:03.1586722Z             x1 = x1.contiguous()
2025-05-07T20:32:03.1586970Z     
2025-05-07T20:32:03.1587174Z         if scale_ub is not None:
2025-05-07T20:32:03.1587453Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.1587807Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.1588126Z             )
2025-05-07T20:32:03.1588328Z         else:
2025-05-07T20:32:03.1588549Z             scale_ub_tensor = None
2025-05-07T20:32:03.1588811Z     
2025-05-07T20:32:03.1589055Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.1589379Z             op = silu_mul_quant
2025-05-07T20:32:03.1589638Z             if compiled:
2025-05-07T20:32:03.1589893Z                 op = torch.compile(op)
2025-05-07T20:32:03.1590198Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.1590525Z     
2025-05-07T20:32:03.1590732Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.1590902Z 
2025-05-07T20:32:03.1591007Z moe/activation_test.py:117: 
2025-05-07T20:32:03.1591311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.1591648Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.1591944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.1592668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.1593389Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.1593949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.1594778Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.1595465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.1596016Z     kernel = self.compile(
2025-05-07T20:32:03.1596580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.1597253Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.1597666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.1597896Z 
2025-05-07T20:32:03.1598119Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9dcde080>
2025-05-07T20:32:03.1599242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.1600664Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc23c5090>}
2025-05-07T20:32:03.1602048Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.1603107Z context = <triton._C.libtriton.ir.context object at 0x7fac9dce6e70>
2025-05-07T20:32:03.1603408Z 
2025-05-07T20:32:03.1603588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.1604204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.1604696Z                            module_map=module_map)
2025-05-07T20:32:03.1605073Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.1605445Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.1605709Z E       ^
2025-05-07T20:32:03.1606190Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.1606653Z 
2025-05-07T20:32:03.1607088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.1607615Z 
2025-05-07T20:32:03.1607724Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.1608156Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.1608568Z     T=4096,
2025-05-07T20:32:03.1608770Z     D=7168,
2025-05-07T20:32:03.1608964Z     scale_ub=None,
2025-05-07T20:32:03.1609189Z     contiguous=False,
2025-05-07T20:32:03.1609435Z     compiled=False,
2025-05-07T20:32:03.1609648Z )
2025-05-07T20:32:03.1609992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.1610549Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:03.1610827Z 
2025-05-07T20:32:03.1610907Z     @given(
2025-05-07T20:32:03.1611149Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.1611472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.1611782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.1612125Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.1612465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.1612756Z     )
2025-05-07T20:32:03.1613108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.1613561Z     def test_silu_mul_quant(
2025-05-07T20:32:03.1613814Z         self,
2025-05-07T20:32:03.1614015Z         T: int,
2025-05-07T20:32:03.1614221Z         D: int,
2025-05-07T20:32:03.1614447Z         scale_ub: Optional[float],
2025-05-07T20:32:03.1614723Z         contiguous: bool,
2025-05-07T20:32:03.1615058Z         compiled: bool,
2025-05-07T20:32:03.1615289Z     ) -> None:
2025-05-07T20:32:03.1615508Z         torch.manual_seed(2025)
2025-05-07T20:32:03.1615759Z     
2025-05-07T20:32:03.1616042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.1616389Z     
2025-05-07T20:32:03.1616592Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.1616893Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.1617211Z         x = x_sign * x_clamp
2025-05-07T20:32:03.1617454Z         x0 = x[:, :D]
2025-05-07T20:32:03.1617676Z         x1 = x[:, D:]
2025-05-07T20:32:03.1618104Z     
2025-05-07T20:32:03.1618296Z         if contiguous:
2025-05-07T20:32:03.1618533Z             x0 = x0.contiguous()
2025-05-07T20:32:03.1618800Z             x1 = x1.contiguous()
2025-05-07T20:32:03.1619049Z     
2025-05-07T20:32:03.1619250Z         if scale_ub is not None:
2025-05-07T20:32:03.1619530Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.1619880Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.1620202Z             )
2025-05-07T20:32:03.1620437Z         else:
2025-05-07T20:32:03.1620660Z             scale_ub_tensor = None
2025-05-07T20:32:03.1620925Z     
2025-05-07T20:32:03.1621165Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.1621481Z             op = silu_mul_quant
2025-05-07T20:32:03.1621741Z             if compiled:
2025-05-07T20:32:03.1621997Z                 op = torch.compile(op)
2025-05-07T20:32:03.1622299Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.1622583Z     
2025-05-07T20:32:03.1622794Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.1622963Z 
2025-05-07T20:32:03.1623071Z moe/activation_test.py:117: 
2025-05-07T20:32:03.1623485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.1623825Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.1624118Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.1624856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.1625564Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.1626119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.1626823Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.1627503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.1628055Z     kernel = self.compile(
2025-05-07T20:32:03.1628614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.1629300Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.1629708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.1629950Z 
2025-05-07T20:32:03.1630166Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9dcdc9a0>
2025-05-07T20:32:03.1631281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.1632696Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc23c6560>}
2025-05-07T20:32:03.1634080Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.1635136Z context = <triton._C.libtriton.ir.context object at 0x7fac9d7e0a30>
2025-05-07T20:32:03.1635436Z 
2025-05-07T20:32:03.1635691Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.1636229Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.1636708Z                            module_map=module_map)
2025-05-07T20:32:03.1637083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.1637446Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.1637715Z E       ^
2025-05-07T20:32:03.1638191Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.1638656Z 
2025-05-07T20:32:03.1639087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.1639614Z 
2025-05-07T20:32:03.1639730Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.1640195Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.1640621Z     T=128,
2025-05-07T20:32:03.1640814Z     D=7168,
2025-05-07T20:32:03.1641011Z     scale_ub=None,
2025-05-07T20:32:03.1641229Z     contiguous=False,
2025-05-07T20:32:03.1641461Z     compiled=True,
2025-05-07T20:32:03.1641672Z )
2025-05-07T20:32:03.2271606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.2272138Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:03.2272449Z 
2025-05-07T20:32:03.2272535Z     @given(
2025-05-07T20:32:03.2272803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.2273121Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.2273440Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.2273781Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.2274294Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.2274596Z     )
2025-05-07T20:32:03.2274960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.2275426Z     def test_silu_mul_quant(
2025-05-07T20:32:03.2275671Z         self,
2025-05-07T20:32:03.2275873Z         T: int,
2025-05-07T20:32:03.2276078Z         D: int,
2025-05-07T20:32:03.2276299Z         scale_ub: Optional[float],
2025-05-07T20:32:03.2276581Z         contiguous: bool,
2025-05-07T20:32:03.2276833Z         compiled: bool,
2025-05-07T20:32:03.2277057Z     ) -> None:
2025-05-07T20:32:03.2277281Z         torch.manual_seed(2025)
2025-05-07T20:32:03.2277530Z     
2025-05-07T20:32:03.2277807Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.2278159Z     
2025-05-07T20:32:03.2278360Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.2278651Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.2278976Z         x = x_sign * x_clamp
2025-05-07T20:32:03.2279221Z         x0 = x[:, :D]
2025-05-07T20:32:03.2279442Z         x1 = x[:, D:]
2025-05-07T20:32:03.2279653Z     
2025-05-07T20:32:03.2279848Z         if contiguous:
2025-05-07T20:32:03.2280083Z             x0 = x0.contiguous()
2025-05-07T20:32:03.2280377Z             x1 = x1.contiguous()
2025-05-07T20:32:03.2280648Z     
2025-05-07T20:32:03.2280846Z         if scale_ub is not None:
2025-05-07T20:32:03.2281122Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.2281467Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.2281781Z             )
2025-05-07T20:32:03.2281972Z         else:
2025-05-07T20:32:03.2282188Z             scale_ub_tensor = None
2025-05-07T20:32:03.2282453Z     
2025-05-07T20:32:03.2282688Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.2283008Z             op = silu_mul_quant
2025-05-07T20:32:03.2283267Z             if compiled:
2025-05-07T20:32:03.2283522Z                 op = torch.compile(op)
2025-05-07T20:32:03.2283829Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.2284107Z     
2025-05-07T20:32:03.2284429Z         y_fp8, y_scale = fn()
2025-05-07T20:32:03.2284725Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:03.2285025Z     
2025-05-07T20:32:03.2285266Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.2285618Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:03.2292780Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:03.2293171Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:03.2293543Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:03.2293851Z     
2025-05-07T20:32:03.2294060Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:03.2294262Z 
2025-05-07T20:32:03.2294371Z moe/activation_test.py:126: 
2025-05-07T20:32:03.2294675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.2295015Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:03.2295347Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:03.2296159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:03.2296919Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:03.2297475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.2298269Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.2298969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:03.2299709Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:03.2300641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:03.2301407Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:03.2302148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:03.2302801Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:03.2303411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:03.2303937Z     fn()
2025-05-07T20:32:03.2304449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:03.2305044Z     self.fn.run(
2025-05-07T20:32:03.2305523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.2306057Z     kernel = self.compile(
2025-05-07T20:32:03.2306621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.2307292Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.2307701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.2307929Z 
2025-05-07T20:32:03.2308144Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d600e20>
2025-05-07T20:32:03.2309251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.2310711Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc23c7d00>}
2025-05-07T20:32:03.2312085Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.2313216Z context = <triton._C.libtriton.ir.context object at 0x7fac9d8c88b0>
2025-05-07T20:32:03.2313516Z 
2025-05-07T20:32:03.2313687Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.2314225Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.2314703Z                            module_map=module_map)
2025-05-07T20:32:03.2315071Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.2315439Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:03.2315716Z E       ^
2025-05-07T20:32:03.2316185Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.2316647Z 
2025-05-07T20:32:03.2317075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.2317601Z 
2025-05-07T20:32:03.2317709Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.2318143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.2318542Z     T=128,
2025-05-07T20:32:03.2318742Z     D=7168,
2025-05-07T20:32:03.2318939Z     scale_ub=None,
2025-05-07T20:32:03.2319157Z     contiguous=False,
2025-05-07T20:32:03.2319388Z     compiled=False,
2025-05-07T20:32:03.2319617Z )
2025-05-07T20:32:03.5919603Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.5920147Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:03.5920491Z 
2025-05-07T20:32:03.5920582Z     @given(
2025-05-07T20:32:03.5920818Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.5921163Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.5921663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.5922010Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.5922349Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.5922648Z     )
2025-05-07T20:32:03.5923004Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.5923461Z     def test_silu_mul_quant(
2025-05-07T20:32:03.5923712Z         self,
2025-05-07T20:32:03.5923912Z         T: int,
2025-05-07T20:32:03.5924120Z         D: int,
2025-05-07T20:32:03.5924347Z         scale_ub: Optional[float],
2025-05-07T20:32:03.5924632Z         contiguous: bool,
2025-05-07T20:32:03.5924875Z         compiled: bool,
2025-05-07T20:32:03.5925112Z     ) -> None:
2025-05-07T20:32:03.5925339Z         torch.manual_seed(2025)
2025-05-07T20:32:03.5925585Z     
2025-05-07T20:32:03.5925871Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.5926224Z     
2025-05-07T20:32:03.5926429Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.5926734Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.5927054Z         x = x_sign * x_clamp
2025-05-07T20:32:03.5927303Z         x0 = x[:, :D]
2025-05-07T20:32:03.5927530Z         x1 = x[:, D:]
2025-05-07T20:32:03.5927746Z     
2025-05-07T20:32:03.5927940Z         if contiguous:
2025-05-07T20:32:03.5928186Z             x0 = x0.contiguous()
2025-05-07T20:32:03.5928453Z             x1 = x1.contiguous()
2025-05-07T20:32:03.5928698Z     
2025-05-07T20:32:03.5928900Z         if scale_ub is not None:
2025-05-07T20:32:03.5929181Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.5929528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.5929848Z             )
2025-05-07T20:32:03.5930049Z         else:
2025-05-07T20:32:03.5930270Z             scale_ub_tensor = None
2025-05-07T20:32:03.5930552Z     
2025-05-07T20:32:03.5930815Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.5931144Z             op = silu_mul_quant
2025-05-07T20:32:03.5931395Z             if compiled:
2025-05-07T20:32:03.5931648Z                 op = torch.compile(op)
2025-05-07T20:32:03.5932140Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.5932412Z     
2025-05-07T20:32:03.5932609Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.5932775Z 
2025-05-07T20:32:03.5932885Z moe/activation_test.py:117: 
2025-05-07T20:32:03.5933180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.5933513Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.5933803Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.5934511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.5935216Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.5935766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.5936471Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.5937142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.5937691Z     kernel = self.compile(
2025-05-07T20:32:03.5938378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.5939053Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.5939451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.5939686Z 
2025-05-07T20:32:03.5939901Z self = <triton.compiler.compiler.ASTSource object at 0x7facc211c100>
2025-05-07T20:32:03.5941095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.5942506Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc23c72e0>}
2025-05-07T20:32:03.5943880Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.5944927Z context = <triton._C.libtriton.ir.context object at 0x7fac9d6f42b0>
2025-05-07T20:32:03.5945255Z 
2025-05-07T20:32:03.5945426Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.5945962Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.5946445Z                            module_map=module_map)
2025-05-07T20:32:03.5946813Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.5947185Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.5947448Z E       ^
2025-05-07T20:32:03.5947924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.5948388Z 
2025-05-07T20:32:03.5948812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.5949340Z 
2025-05-07T20:32:03.5949448Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.5949870Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.5950315Z     T=4096,
2025-05-07T20:32:03.5950520Z     D=5120,
2025-05-07T20:32:03.5950718Z     scale_ub=1200.0,
2025-05-07T20:32:03.5950949Z     contiguous=True,
2025-05-07T20:32:03.5951174Z     compiled=False,
2025-05-07T20:32:03.5951386Z )
2025-05-07T20:32:03.5951718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.5952219Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:03.5952499Z 
2025-05-07T20:32:03.5952577Z     @given(
2025-05-07T20:32:03.5952895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.5953207Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.5953518Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.5953852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.5954185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.5954471Z     )
2025-05-07T20:32:03.5954825Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.5955275Z     def test_silu_mul_quant(
2025-05-07T20:32:03.5955514Z         self,
2025-05-07T20:32:03.5955982Z         T: int,
2025-05-07T20:32:03.5956186Z         D: int,
2025-05-07T20:32:03.5956401Z         scale_ub: Optional[float],
2025-05-07T20:32:03.5956679Z         contiguous: bool,
2025-05-07T20:32:03.5956931Z         compiled: bool,
2025-05-07T20:32:03.5957153Z     ) -> None:
2025-05-07T20:32:03.5957375Z         torch.manual_seed(2025)
2025-05-07T20:32:03.5957623Z     
2025-05-07T20:32:03.5957904Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.5958252Z     
2025-05-07T20:32:03.5958449Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.5958743Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.5959057Z         x = x_sign * x_clamp
2025-05-07T20:32:03.5959300Z         x0 = x[:, :D]
2025-05-07T20:32:03.5959520Z         x1 = x[:, D:]
2025-05-07T20:32:03.5959725Z     
2025-05-07T20:32:03.5959916Z         if contiguous:
2025-05-07T20:32:03.5960152Z             x0 = x0.contiguous()
2025-05-07T20:32:03.5960439Z             x1 = x1.contiguous()
2025-05-07T20:32:03.5960706Z     
2025-05-07T20:32:03.5960901Z         if scale_ub is not None:
2025-05-07T20:32:03.5961174Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.5961657Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.5961972Z             )
2025-05-07T20:32:03.5962164Z         else:
2025-05-07T20:32:03.5962379Z             scale_ub_tensor = None
2025-05-07T20:32:03.5962641Z     
2025-05-07T20:32:03.5962875Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.5963192Z             op = silu_mul_quant
2025-05-07T20:32:03.5963443Z             if compiled:
2025-05-07T20:32:03.5963688Z                 op = torch.compile(op)
2025-05-07T20:32:03.5963991Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.5964268Z     
2025-05-07T20:32:03.5964459Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.5964631Z 
2025-05-07T20:32:03.5964731Z moe/activation_test.py:117: 
2025-05-07T20:32:03.5965029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.5965358Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.5965641Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.5966352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.5967068Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.5967615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.5968303Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.5968982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.5969523Z     kernel = self.compile(
2025-05-07T20:32:03.5970075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.5970741Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.5971140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.5971372Z 
2025-05-07T20:32:03.5971590Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d603700>
2025-05-07T20:32:03.5972691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.5974239Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9df6cdc0>}
2025-05-07T20:32:03.5975612Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.5976659Z context = <triton._C.libtriton.ir.context object at 0x7fac9d109330>
2025-05-07T20:32:03.5976950Z 
2025-05-07T20:32:03.5977131Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.5977659Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.5978328Z                            module_map=module_map)
2025-05-07T20:32:03.5978703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.5979067Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.5979327Z E       ^
2025-05-07T20:32:03.5979797Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.5980266Z 
2025-05-07T20:32:03.5980734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.5981252Z 
2025-05-07T20:32:03.5981363Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.5981779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.5982183Z     T=1,
2025-05-07T20:32:03.5982456Z     D=5120,
2025-05-07T20:32:03.5982648Z     scale_ub=None,
2025-05-07T20:32:03.5982867Z     contiguous=True,
2025-05-07T20:32:03.5983096Z     compiled=True,
2025-05-07T20:32:03.5983305Z )
2025-05-07T20:32:04.0553216Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.0554321Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:04.0555910Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.0557486Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.0558904Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.0560321Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.0561652Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.0563056Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.0564500Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.0565961Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:04.0567210Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.0568448Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:04.0569508Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:04.0570595Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:04.0571846Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.0573152Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.0574289Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:04.0575346Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:04.0576665Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.0578147Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.0579229Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.0580157Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.0580957Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:04.0582000Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.2175274Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.2176783Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:04.2178252Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.2179778Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.2181197Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.2182812Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.2184141Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.2185543Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.2186990Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.2188262Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:04.2189502Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.2190784Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:04.2191835Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:04.2193011Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:04.2194250Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.2195558Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.2196696Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:04.2197760Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:04.2198969Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.2200344Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.2201472Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.2202400Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.2203153Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:04.2204190Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.6611677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.6612490Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:04.6612858Z 
2025-05-07T20:32:04.6612944Z     @given(
2025-05-07T20:32:04.6613198Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.6613527Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.6613845Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.6614191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.6614532Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.6614825Z     )
2025-05-07T20:32:04.6615186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.6615646Z     def test_silu_mul_quant(
2025-05-07T20:32:04.6615904Z         self,
2025-05-07T20:32:04.6616111Z         T: int,
2025-05-07T20:32:04.6616321Z         D: int,
2025-05-07T20:32:04.6616554Z         scale_ub: Optional[float],
2025-05-07T20:32:04.6616834Z         contiguous: bool,
2025-05-07T20:32:04.6617090Z         compiled: bool,
2025-05-07T20:32:04.6617330Z     ) -> None:
2025-05-07T20:32:04.6617552Z         torch.manual_seed(2025)
2025-05-07T20:32:04.6617805Z     
2025-05-07T20:32:04.6618161Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.6618515Z     
2025-05-07T20:32:04.6618723Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.6619028Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.6619344Z         x = x_sign * x_clamp
2025-05-07T20:32:04.6619593Z         x0 = x[:, :D]
2025-05-07T20:32:04.6619820Z         x1 = x[:, D:]
2025-05-07T20:32:04.6620039Z     
2025-05-07T20:32:04.6620231Z         if contiguous:
2025-05-07T20:32:04.6620474Z             x0 = x0.contiguous()
2025-05-07T20:32:04.6620923Z             x1 = x1.contiguous()
2025-05-07T20:32:04.6621173Z     
2025-05-07T20:32:04.6621375Z         if scale_ub is not None:
2025-05-07T20:32:04.6621661Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.6622014Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.6622332Z             )
2025-05-07T20:32:04.6622535Z         else:
2025-05-07T20:32:04.6622749Z             scale_ub_tensor = None
2025-05-07T20:32:04.6623012Z     
2025-05-07T20:32:04.6623257Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.6623578Z             op = silu_mul_quant
2025-05-07T20:32:04.6623839Z             if compiled:
2025-05-07T20:32:04.6624100Z                 op = torch.compile(op)
2025-05-07T20:32:04.6624407Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.6624695Z     
2025-05-07T20:32:04.6624896Z         y_fp8, y_scale = fn()
2025-05-07T20:32:04.6625195Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:04.6625490Z     
2025-05-07T20:32:04.6625750Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.6626098Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:04.6626403Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:04.6626734Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:04.6627107Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:04.6627428Z     
2025-05-07T20:32:04.6627643Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:04.6627847Z 
2025-05-07T20:32:04.6627959Z moe/activation_test.py:126: 
2025-05-07T20:32:04.6628265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6628609Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:04.6628953Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:04.6629779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:04.6630551Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:04.6631116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.6631901Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.6632608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:04.6633340Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:04.6634108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:04.6634869Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:04.6635617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:04.6636267Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:04.6636884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:04.6637420Z     fn()
2025-05-07T20:32:04.6637936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:04.6638528Z     self.fn.run(
2025-05-07T20:32:04.6639008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.6639550Z     kernel = self.compile(
2025-05-07T20:32:04.6640098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.6640820Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.6641226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.6641454Z 
2025-05-07T20:32:04.6641749Z self = <triton.compiler.compiler.ASTSource object at 0x7facc211c3d0>
2025-05-07T20:32:04.6642861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.6644274Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc2018430>}
2025-05-07T20:32:04.6645647Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.6646698Z context = <triton._C.libtriton.ir.context object at 0x7fac9d1b5870>
2025-05-07T20:32:04.6646993Z 
2025-05-07T20:32:04.6647169Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.6647704Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.6648191Z                            module_map=module_map)
2025-05-07T20:32:04.6648561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.6648928Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:04.6649207Z E       ^
2025-05-07T20:32:04.6649687Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.6650145Z 
2025-05-07T20:32:04.6650573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.6651149Z 
2025-05-07T20:32:04.6651258Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.6651684Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.6652093Z     T=2048,
2025-05-07T20:32:04.6652289Z     D=5120,
2025-05-07T20:32:04.6652494Z     scale_ub=None,
2025-05-07T20:32:04.6652719Z     contiguous=True,
2025-05-07T20:32:04.6652947Z     compiled=True,
2025-05-07T20:32:04.6653243Z )
2025-05-07T20:32:05.0816882Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.0818277Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:05.0819640Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.0821150Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.0822554Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.0823956Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.0825289Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.0826689Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.0828296Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.0829570Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:05.0830832Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.0832081Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:05.0833139Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:05.0834175Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:05.0835422Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.0836718Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.0837855Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:05.0838910Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:05.0840114Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.0841615Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.0842685Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.0843613Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.0844364Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:05.0845402Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.2435641Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.2437106Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:05.2438467Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.2439914Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.2441556Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.2442968Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.2444288Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.2445689Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.2447133Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.2448405Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:05.2449643Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.2450900Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:05.2451973Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:05.2453017Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:05.2454256Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.2455871Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.2457002Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:05.2458126Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:05.2459335Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.2460716Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.2461791Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.2462709Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.2463456Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:05.2464610Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.6846983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.6847685Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.6847974Z 
2025-05-07T20:32:05.6848060Z     @given(
2025-05-07T20:32:05.6848296Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.6848617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.6848955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.6849296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.6849629Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.6849922Z     )
2025-05-07T20:32:05.6850281Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.6850731Z     def test_silu_mul_quant(
2025-05-07T20:32:05.6851005Z         self,
2025-05-07T20:32:05.6851226Z         T: int,
2025-05-07T20:32:05.6851428Z         D: int,
2025-05-07T20:32:05.6851644Z         scale_ub: Optional[float],
2025-05-07T20:32:05.6851920Z         contiguous: bool,
2025-05-07T20:32:05.6852168Z         compiled: bool,
2025-05-07T20:32:05.6852392Z     ) -> None:
2025-05-07T20:32:05.6852616Z         torch.manual_seed(2025)
2025-05-07T20:32:05.6852864Z     
2025-05-07T20:32:05.6853139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.6853485Z     
2025-05-07T20:32:05.6853692Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.6853985Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.6854298Z         x = x_sign * x_clamp
2025-05-07T20:32:05.6854540Z         x0 = x[:, :D]
2025-05-07T20:32:05.6854754Z         x1 = x[:, D:]
2025-05-07T20:32:05.6854963Z     
2025-05-07T20:32:05.6855153Z         if contiguous:
2025-05-07T20:32:05.6855388Z             x0 = x0.contiguous()
2025-05-07T20:32:05.6855833Z             x1 = x1.contiguous()
2025-05-07T20:32:05.6856079Z     
2025-05-07T20:32:05.6856278Z         if scale_ub is not None:
2025-05-07T20:32:05.6856551Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.6857065Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.6857375Z             )
2025-05-07T20:32:05.6857565Z         else:
2025-05-07T20:32:05.6857778Z             scale_ub_tensor = None
2025-05-07T20:32:05.6858085Z     
2025-05-07T20:32:05.6858316Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.6858641Z             op = silu_mul_quant
2025-05-07T20:32:05.6858890Z             if compiled:
2025-05-07T20:32:05.6859134Z                 op = torch.compile(op)
2025-05-07T20:32:05.6859439Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.6859714Z     
2025-05-07T20:32:05.6859907Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.6860200Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.6860492Z     
2025-05-07T20:32:05.6860738Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.6861122Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.6861430Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.6861750Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.6862105Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.6862422Z     
2025-05-07T20:32:05.6862626Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.6862823Z 
2025-05-07T20:32:05.6862929Z moe/activation_test.py:126: 
2025-05-07T20:32:05.6863226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.6863560Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.6863891Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.6864874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.6865645Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.6866202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.6866894Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.6867594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.6868329Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.6869093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.6869849Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.6870588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.6871250Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.6871861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.6872391Z     fn()
2025-05-07T20:32:05.6872914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.6873508Z     self.fn.run(
2025-05-07T20:32:05.6873980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.6874522Z     kernel = self.compile(
2025-05-07T20:32:05.6875077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.6875748Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.6876148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.6876381Z 
2025-05-07T20:32:05.6876599Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d357be0>
2025-05-07T20:32:05.6877708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.6879223Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9d8c2d40>}
2025-05-07T20:32:05.6880592Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.6881690Z context = <triton._C.libtriton.ir.context object at 0x7fac9d203030>
2025-05-07T20:32:05.6881987Z 
2025-05-07T20:32:05.6882161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.6882690Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.6883168Z                            module_map=module_map)
2025-05-07T20:32:05.6883539Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.6883900Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.6884168Z E       ^
2025-05-07T20:32:05.6884643Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.6885105Z 
2025-05-07T20:32:05.6885530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.6886050Z 
2025-05-07T20:32:05.6886161Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.6886578Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.6886984Z     T=128,
2025-05-07T20:32:05.6887257Z     D=5120,
2025-05-07T20:32:05.6887452Z     scale_ub=None,
2025-05-07T20:32:05.6887673Z     contiguous=True,
2025-05-07T20:32:05.6887901Z     compiled=True,
2025-05-07T20:32:05.6888117Z )
2025-05-07T20:32:06.1569375Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.1570468Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:06.1571830Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.1573292Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.1574693Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.1576108Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.1577447Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.1578941Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.1580398Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.1581832Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:06.1583072Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.1584308Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:06.1585369Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:06.1586413Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:06.1587657Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.1588964Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.1590106Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:06.1591167Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:06.1592480Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.1593861Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.1594937Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.1595865Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.1596620Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:06.1597669Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.3209194Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.3210875Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:06.3212236Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.3213692Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.3215113Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.3216681Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.3218076Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.3219480Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.3220936Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.3222251Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:06.3223497Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.3224729Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:06.3225785Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:06.3226936Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:06.3228186Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.3229498Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.3230640Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:06.3231759Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:06.3232964Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.3234352Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.3235433Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.3236363Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.3237110Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:06.3238154Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.0931706Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.0932475Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:07.0932872Z 
2025-05-07T20:32:07.0932958Z     @given(
2025-05-07T20:32:07.0933202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.0933532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.0933853Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.0934190Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.0934530Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.0934822Z     )
2025-05-07T20:32:07.0935186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.0935638Z     def test_silu_mul_quant(
2025-05-07T20:32:07.0935887Z         self,
2025-05-07T20:32:07.0936086Z         T: int,
2025-05-07T20:32:07.0936288Z         D: int,
2025-05-07T20:32:07.0936518Z         scale_ub: Optional[float],
2025-05-07T20:32:07.0936792Z         contiguous: bool,
2025-05-07T20:32:07.0937043Z         compiled: bool,
2025-05-07T20:32:07.0937273Z     ) -> None:
2025-05-07T20:32:07.0937491Z         torch.manual_seed(2025)
2025-05-07T20:32:07.0937739Z     
2025-05-07T20:32:07.0938151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.0938499Z     
2025-05-07T20:32:07.0938696Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.0938998Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.0939317Z         x = x_sign * x_clamp
2025-05-07T20:32:07.0939565Z         x0 = x[:, :D]
2025-05-07T20:32:07.0939783Z         x1 = x[:, D:]
2025-05-07T20:32:07.0939996Z     
2025-05-07T20:32:07.0940193Z         if contiguous:
2025-05-07T20:32:07.0940429Z             x0 = x0.contiguous()
2025-05-07T20:32:07.0940720Z             x1 = x1.contiguous()
2025-05-07T20:32:07.0941154Z     
2025-05-07T20:32:07.0941357Z         if scale_ub is not None:
2025-05-07T20:32:07.0941639Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.0941991Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.0942297Z             )
2025-05-07T20:32:07.0942499Z         else:
2025-05-07T20:32:07.0942719Z             scale_ub_tensor = None
2025-05-07T20:32:07.0942980Z     
2025-05-07T20:32:07.0943219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.0943535Z             op = silu_mul_quant
2025-05-07T20:32:07.0943787Z             if compiled:
2025-05-07T20:32:07.0944043Z                 op = torch.compile(op)
2025-05-07T20:32:07.0944350Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.0944630Z     
2025-05-07T20:32:07.0944824Z         y_fp8, y_scale = fn()
2025-05-07T20:32:07.0945118Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:07.0945417Z     
2025-05-07T20:32:07.0945667Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.0946010Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:07.0946316Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:07.0946632Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:07.0946997Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.0947312Z     
2025-05-07T20:32:07.0947516Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:07.0947720Z 
2025-05-07T20:32:07.0947826Z moe/activation_test.py:126: 
2025-05-07T20:32:07.0948133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.0948471Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:07.0948802Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.0949606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:07.0950375Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:07.0950935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.0951775Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.0952486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:07.0953218Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.0953977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:07.0954743Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.0955489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:07.0956330Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:07.0956939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:07.0957470Z     fn()
2025-05-07T20:32:07.0957991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:07.0958579Z     self.fn.run(
2025-05-07T20:32:07.0959062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.0959603Z     kernel = self.compile(
2025-05-07T20:32:07.0960159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.0960828Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.0961231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.0961458Z 
2025-05-07T20:32:07.0961804Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d601d50>
2025-05-07T20:32:07.0962916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.0964317Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9d8c2e60>}
2025-05-07T20:32:07.0965690Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.0966738Z context = <triton._C.libtriton.ir.context object at 0x7fac9cea2930>
2025-05-07T20:32:07.0967033Z 
2025-05-07T20:32:07.0967214Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.0967741Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.0968231Z                            module_map=module_map)
2025-05-07T20:32:07.0968609Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.0968976Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:07.0969251Z E       ^
2025-05-07T20:32:07.0969728Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.0970185Z 
2025-05-07T20:32:07.0970616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.0971160Z 
2025-05-07T20:32:07.0971282Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.0971716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.0972129Z     T=4096,
2025-05-07T20:32:07.0972329Z     D=5120,
2025-05-07T20:32:07.0972523Z     scale_ub=None,
2025-05-07T20:32:07.0972747Z     contiguous=True,
2025-05-07T20:32:07.0972982Z     compiled=True,
2025-05-07T20:32:07.0973312Z )
2025-05-07T20:32:07.5693406Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:07.5695668Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:07.5698312Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:07.5700995Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:07.5702631Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:07.5704048Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.5705389Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:07.5706797Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.5708414Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:07.5709699Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:07.5710942Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:07.5712179Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:07.5713239Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:07.5714281Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:07.5715524Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:07.5716829Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:07.5717968Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:07.5719030Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:07.5720235Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:07.5721779Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:07.5722855Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.5723781Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.5724532Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:07.5725576Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.7338806Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:07.7339903Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:07.7341296Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:07.7342772Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:07.7344350Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:07.7345767Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.7347099Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:07.7348501Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.7349944Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:07.7351212Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:07.7352501Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:07.7353721Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:07.7354772Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:07.7356051Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:07.7357293Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:07.7358722Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:07.7359854Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:07.7360916Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:07.7362114Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:07.7363495Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:07.7364564Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.7365491Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.7366243Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:07.7367419Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.3217570Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.3218255Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:08.3218546Z 
2025-05-07T20:32:08.3226220Z     @given(
2025-05-07T20:32:08.3226533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.3226874Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.3227209Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.3227559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.3227893Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.3228192Z     )
2025-05-07T20:32:08.3228563Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.3229020Z     def test_silu_mul_quant(
2025-05-07T20:32:08.3229277Z         self,
2025-05-07T20:32:08.3229501Z         T: int,
2025-05-07T20:32:08.3229698Z         D: int,
2025-05-07T20:32:08.3229932Z         scale_ub: Optional[float],
2025-05-07T20:32:08.3230220Z         contiguous: bool,
2025-05-07T20:32:08.3230473Z         compiled: bool,
2025-05-07T20:32:08.3230714Z     ) -> None:
2025-05-07T20:32:08.3230950Z         torch.manual_seed(2025)
2025-05-07T20:32:08.3231225Z     
2025-05-07T20:32:08.3231540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.3231905Z     
2025-05-07T20:32:08.3232112Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.3232424Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.3232753Z         x = x_sign * x_clamp
2025-05-07T20:32:08.3233006Z         x0 = x[:, :D]
2025-05-07T20:32:08.3233226Z         x1 = x[:, D:]
2025-05-07T20:32:08.3233453Z     
2025-05-07T20:32:08.3233655Z         if contiguous:
2025-05-07T20:32:08.3233893Z             x0 = x0.contiguous()
2025-05-07T20:32:08.3234174Z             x1 = x1.contiguous()
2025-05-07T20:32:08.3234431Z     
2025-05-07T20:32:08.3234630Z         if scale_ub is not None:
2025-05-07T20:32:08.3234917Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.3235654Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.3235969Z             )
2025-05-07T20:32:08.3236202Z         else:
2025-05-07T20:32:08.3236429Z             scale_ub_tensor = None
2025-05-07T20:32:08.3236696Z     
2025-05-07T20:32:08.3236933Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.3237264Z             op = silu_mul_quant
2025-05-07T20:32:08.3237526Z             if compiled:
2025-05-07T20:32:08.3237776Z                 op = torch.compile(op)
2025-05-07T20:32:08.3238090Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3238381Z     
2025-05-07T20:32:08.3238582Z         y_fp8, y_scale = fn()
2025-05-07T20:32:08.3238884Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:08.3239185Z     
2025-05-07T20:32:08.3239444Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.3239784Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:08.3240099Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:08.3240426Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:08.3240793Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.3241165Z     
2025-05-07T20:32:08.3241382Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:08.3241585Z 
2025-05-07T20:32:08.3241692Z moe/activation_test.py:126: 
2025-05-07T20:32:08.3242006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3242349Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:08.3242695Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.3243677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:08.3244473Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:08.3245036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.3245751Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.3246470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:08.3247217Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.3247985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:08.3248755Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.3249505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:08.3250174Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:08.3250795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:08.3251385Z     fn()
2025-05-07T20:32:08.3251911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:08.3252502Z     self.fn.run(
2025-05-07T20:32:08.3252992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.3253540Z     kernel = self.compile(
2025-05-07T20:32:08.3254100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.3254768Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.3255178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3255410Z 
2025-05-07T20:32:08.3256164Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d354130>
2025-05-07T20:32:08.3257279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.3258918Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9d456320>}
2025-05-07T20:32:08.3260311Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.3261424Z context = <triton._C.libtriton.ir.context object at 0x7fac9c8943b0>
2025-05-07T20:32:08.3261722Z 
2025-05-07T20:32:08.3261907Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.3262438Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.3262929Z                            module_map=module_map)
2025-05-07T20:32:08.3263311Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.3263678Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:08.3263955Z E       ^
2025-05-07T20:32:08.3264443Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.3264905Z 
2025-05-07T20:32:08.3265346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.3265872Z 
2025-05-07T20:32:08.3265982Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.3266414Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.3266834Z     T=16384,
2025-05-07T20:32:08.3267157Z     D=5120,
2025-05-07T20:32:08.3267359Z     scale_ub=None,
2025-05-07T20:32:08.3267585Z     contiguous=True,
2025-05-07T20:32:08.3267816Z     compiled=True,
2025-05-07T20:32:08.3268032Z )
2025-05-07T20:32:08.3651717Z W0507 20:32:08.363000 87377 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:08.3652999Z W0507 20:32:08.363000 87377 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:08.3654370Z W0507 20:32:08.363000 87377 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:08.3655379Z W0507 20:32:08.363000 87377 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:08.3656755Z W0507 20:32:08.363000 87377 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:08.4679172Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.4679750Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:08.4680038Z 
2025-05-07T20:32:08.4680129Z     @given(
2025-05-07T20:32:08.4680369Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.4680699Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.4681022Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.4681362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.4681711Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.4682010Z     )
2025-05-07T20:32:08.4682379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.4682855Z     def test_silu_mul_quant(
2025-05-07T20:32:08.4683112Z         self,
2025-05-07T20:32:08.4683324Z         T: int,
2025-05-07T20:32:08.4683861Z         D: int,
2025-05-07T20:32:08.4684097Z         scale_ub: Optional[float],
2025-05-07T20:32:08.4684386Z         contiguous: bool,
2025-05-07T20:32:08.4684628Z         compiled: bool,
2025-05-07T20:32:08.4684864Z     ) -> None:
2025-05-07T20:32:08.4685095Z         torch.manual_seed(2025)
2025-05-07T20:32:08.4685342Z     
2025-05-07T20:32:08.4685638Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.4686001Z     
2025-05-07T20:32:08.4686199Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.4686507Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.4686835Z         x = x_sign * x_clamp
2025-05-07T20:32:08.4687091Z         x0 = x[:, :D]
2025-05-07T20:32:08.4687314Z         x1 = x[:, D:]
2025-05-07T20:32:08.4687535Z     
2025-05-07T20:32:08.4687742Z         if contiguous:
2025-05-07T20:32:08.4687986Z             x0 = x0.contiguous()
2025-05-07T20:32:08.4688265Z             x1 = x1.contiguous()
2025-05-07T20:32:08.4688519Z     
2025-05-07T20:32:08.4688722Z         if scale_ub is not None:
2025-05-07T20:32:08.4689005Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.4689355Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.4689672Z             )
2025-05-07T20:32:08.4689877Z         else:
2025-05-07T20:32:08.4690101Z             scale_ub_tensor = None
2025-05-07T20:32:08.4690357Z     
2025-05-07T20:32:08.4690604Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.4690935Z             op = silu_mul_quant
2025-05-07T20:32:08.4691232Z             if compiled:
2025-05-07T20:32:08.4691500Z                 op = torch.compile(op)
2025-05-07T20:32:08.4691811Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.4692093Z     
2025-05-07T20:32:08.4692300Z         y_fp8, y_scale = fn()
2025-05-07T20:32:08.4692751Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:08.4693059Z     
2025-05-07T20:32:08.4693306Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.4693664Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:08.4693967Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:08.4694290Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:08.4694665Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.4694994Z     
2025-05-07T20:32:08.4695203Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:08.4695414Z 
2025-05-07T20:32:08.4695523Z moe/activation_test.py:126: 
2025-05-07T20:32:08.4695835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.4696184Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:08.4696521Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.4697346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:08.4698246Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:08.4698816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.4699526Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.4700244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:08.4700998Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.4701772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:08.4702555Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.4703322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:08.4703992Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:08.4704698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:08.4705243Z     fn()
2025-05-07T20:32:08.4705781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:08.4706378Z     self.fn.run(
2025-05-07T20:32:08.4706871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.4707425Z     kernel = self.compile(
2025-05-07T20:32:08.4707990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.4708660Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.4709077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.4709313Z 
2025-05-07T20:32:08.4709539Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9c797010>
2025-05-07T20:32:08.4710669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.4712106Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9d8c25f0>}
2025-05-07T20:32:08.4713504Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.4714691Z context = <triton._C.libtriton.ir.context object at 0x7fac9c311130>
2025-05-07T20:32:08.4714989Z 
2025-05-07T20:32:08.4715178Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.4715722Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.4716228Z                            module_map=module_map)
2025-05-07T20:32:08.4716616Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.4716994Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:08.4717271Z E       ^
2025-05-07T20:32:08.4717757Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.4718228Z 
2025-05-07T20:32:08.4718658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.4719188Z 
2025-05-07T20:32:08.4719303Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.4719739Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.4720158Z     T=1,
2025-05-07T20:32:08.4720355Z     D=5120,
2025-05-07T20:32:08.4720556Z     scale_ub=1200.0,
2025-05-07T20:32:08.4720799Z     contiguous=True,
2025-05-07T20:32:08.4721060Z     compiled=True,
2025-05-07T20:32:08.4721296Z )
2025-05-07T20:32:08.6165965Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.6166543Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:08.6166818Z 
2025-05-07T20:32:08.6166906Z     @given(
2025-05-07T20:32:08.6167149Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.6167472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.6167788Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.6168123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.6168464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.6168760Z     )
2025-05-07T20:32:08.6169130Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.6169584Z     def test_silu_mul_quant(
2025-05-07T20:32:08.6170068Z         self,
2025-05-07T20:32:08.6170270Z         T: int,
2025-05-07T20:32:08.6170470Z         D: int,
2025-05-07T20:32:08.6170697Z         scale_ub: Optional[float],
2025-05-07T20:32:08.6170983Z         contiguous: bool,
2025-05-07T20:32:08.6171227Z         compiled: bool,
2025-05-07T20:32:08.6171463Z     ) -> None:
2025-05-07T20:32:08.6171721Z         torch.manual_seed(2025)
2025-05-07T20:32:08.6171988Z     
2025-05-07T20:32:08.6172398Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.6172846Z     
2025-05-07T20:32:08.6173043Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.6173346Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.6173664Z         x = x_sign * x_clamp
2025-05-07T20:32:08.6173903Z         x0 = x[:, :D]
2025-05-07T20:32:08.6174131Z         x1 = x[:, D:]
2025-05-07T20:32:08.6174346Z     
2025-05-07T20:32:08.6174530Z         if contiguous:
2025-05-07T20:32:08.6174768Z             x0 = x0.contiguous()
2025-05-07T20:32:08.6175038Z             x1 = x1.contiguous()
2025-05-07T20:32:08.6175281Z     
2025-05-07T20:32:08.6175480Z         if scale_ub is not None:
2025-05-07T20:32:08.6175764Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.6176114Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.6176429Z             )
2025-05-07T20:32:08.6176625Z         else:
2025-05-07T20:32:08.6176839Z             scale_ub_tensor = None
2025-05-07T20:32:08.6177092Z     
2025-05-07T20:32:08.6177332Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.6177660Z             op = silu_mul_quant
2025-05-07T20:32:08.6177911Z             if compiled:
2025-05-07T20:32:08.6178274Z                 op = torch.compile(op)
2025-05-07T20:32:08.6178586Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.6179070Z     
2025-05-07T20:32:08.6179273Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.6179444Z 
2025-05-07T20:32:08.6179551Z moe/activation_test.py:117: 
2025-05-07T20:32:08.6179855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.6180199Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.6180494Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.6181073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.6181650Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.6182326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.6183036Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.6183582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.6184289Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.6184970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.6185519Z     kernel = self.compile(
2025-05-07T20:32:08.6186069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.6186743Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.6187144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.6187374Z 
2025-05-07T20:32:08.6187591Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9ccabca0>
2025-05-07T20:32:08.6188708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.6190126Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9cff05e0>}
2025-05-07T20:32:08.6191645Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.6192699Z context = <triton._C.libtriton.ir.context object at 0x7fac9c3d4ff0>
2025-05-07T20:32:08.6192995Z 
2025-05-07T20:32:08.6193170Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.6193700Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.6194184Z                            module_map=module_map)
2025-05-07T20:32:08.6194562Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.6194921Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.6195184Z E       ^
2025-05-07T20:32:08.6195664Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.6196132Z 
2025-05-07T20:32:08.6196564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.6197089Z 
2025-05-07T20:32:08.6197195Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.6197618Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.6198026Z     T=1,
2025-05-07T20:32:08.6198209Z     D=5120,
2025-05-07T20:32:08.6198405Z     scale_ub=None,
2025-05-07T20:32:08.6198630Z     contiguous=False,
2025-05-07T20:32:08.6198860Z     compiled=True,
2025-05-07T20:32:08.6199065Z )
2025-05-07T20:32:08.6877380Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.6878106Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:08.6878385Z 
2025-05-07T20:32:08.6878470Z     @given(
2025-05-07T20:32:08.6878706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.6879038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.6879356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.6879689Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.6880031Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.6880321Z     )
2025-05-07T20:32:08.6880681Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.6881138Z     def test_silu_mul_quant(
2025-05-07T20:32:08.6881389Z         self,
2025-05-07T20:32:08.6881614Z         T: int,
2025-05-07T20:32:08.6881835Z         D: int,
2025-05-07T20:32:08.6882054Z         scale_ub: Optional[float],
2025-05-07T20:32:08.6882332Z         contiguous: bool,
2025-05-07T20:32:08.6882568Z         compiled: bool,
2025-05-07T20:32:08.6882805Z     ) -> None:
2025-05-07T20:32:08.6883028Z         torch.manual_seed(2025)
2025-05-07T20:32:08.6883267Z     
2025-05-07T20:32:08.6883546Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.6883898Z     
2025-05-07T20:32:08.6884092Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.6884386Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.6884699Z         x = x_sign * x_clamp
2025-05-07T20:32:08.6884939Z         x0 = x[:, :D]
2025-05-07T20:32:08.6885158Z         x1 = x[:, D:]
2025-05-07T20:32:08.6885370Z     
2025-05-07T20:32:08.6885563Z         if contiguous:
2025-05-07T20:32:08.6885798Z             x0 = x0.contiguous()
2025-05-07T20:32:08.6886064Z             x1 = x1.contiguous()
2025-05-07T20:32:08.6886313Z     
2025-05-07T20:32:08.6886508Z         if scale_ub is not None:
2025-05-07T20:32:08.6886784Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.6887131Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.6887448Z             )
2025-05-07T20:32:08.6887641Z         else:
2025-05-07T20:32:08.6887856Z             scale_ub_tensor = None
2025-05-07T20:32:08.6888233Z     
2025-05-07T20:32:08.6888468Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.6888785Z             op = silu_mul_quant
2025-05-07T20:32:08.6889036Z             if compiled:
2025-05-07T20:32:08.6889287Z                 op = torch.compile(op)
2025-05-07T20:32:08.6889594Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.6889865Z     
2025-05-07T20:32:08.6890061Z         y_fp8, y_scale = fn()
2025-05-07T20:32:08.6890354Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:08.6890642Z     
2025-05-07T20:32:08.6890884Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.6891223Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:08.6891525Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:08.6891896Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:08.6892264Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.6892583Z     
2025-05-07T20:32:08.6892784Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:08.6892988Z 
2025-05-07T20:32:08.6893091Z moe/activation_test.py:126: 
2025-05-07T20:32:08.6893393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.6893731Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:08.6894069Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.6894875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:08.6895648Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:08.6896204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.6896980Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.6897688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:08.6898582Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.6899369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:08.6900133Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.6900876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:08.6901525Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:08.6902135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:08.6902661Z     fn()
2025-05-07T20:32:08.6903183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:08.6903774Z     self.fn.run(
2025-05-07T20:32:08.6904252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.6904796Z     kernel = self.compile(
2025-05-07T20:32:08.6905349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.6906011Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.6906410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.6906639Z 
2025-05-07T20:32:08.6906858Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9c937130>
2025-05-07T20:32:08.6907966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.6909371Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9cff1090>}
2025-05-07T20:32:08.6910837Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.6911942Z context = <triton._C.libtriton.ir.context object at 0x7faaebdadeb0>
2025-05-07T20:32:08.6912235Z 
2025-05-07T20:32:08.6912408Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.6912936Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.6913412Z                            module_map=module_map)
2025-05-07T20:32:08.6913791Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.6914169Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:08.6914438Z E       ^
2025-05-07T20:32:08.6914922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.6915395Z 
2025-05-07T20:32:08.6923099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.6923664Z 
2025-05-07T20:32:08.6923778Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.6924208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.6924615Z     T=1,
2025-05-07T20:32:08.6924808Z     D=5120,
2025-05-07T20:32:08.6924999Z     scale_ub=None,
2025-05-07T20:32:08.6925217Z     contiguous=True,
2025-05-07T20:32:08.6925445Z     compiled=False,
2025-05-07T20:32:08.6925648Z )
2025-05-07T20:32:09.0306529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.0307640Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:09.0308194Z 
2025-05-07T20:32:09.0308368Z     @given(
2025-05-07T20:32:09.0308847Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.0309500Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.0310137Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.0310819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.0311486Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.0311884Z     )
2025-05-07T20:32:09.0312257Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.0312714Z     def test_silu_mul_quant(
2025-05-07T20:32:09.0312974Z         self,
2025-05-07T20:32:09.0313181Z         T: int,
2025-05-07T20:32:09.0313384Z         D: int,
2025-05-07T20:32:09.0313620Z         scale_ub: Optional[float],
2025-05-07T20:32:09.0313915Z         contiguous: bool,
2025-05-07T20:32:09.0314164Z         compiled: bool,
2025-05-07T20:32:09.0314406Z     ) -> None:
2025-05-07T20:32:09.0314645Z         torch.manual_seed(2025)
2025-05-07T20:32:09.0314894Z     
2025-05-07T20:32:09.0315183Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.0315544Z     
2025-05-07T20:32:09.0315750Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.0316054Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.0316382Z         x = x_sign * x_clamp
2025-05-07T20:32:09.0316632Z         x0 = x[:, :D]
2025-05-07T20:32:09.0316853Z         x1 = x[:, D:]
2025-05-07T20:32:09.0317074Z     
2025-05-07T20:32:09.0317274Z         if contiguous:
2025-05-07T20:32:09.0317513Z             x0 = x0.contiguous()
2025-05-07T20:32:09.0317787Z             x1 = x1.contiguous()
2025-05-07T20:32:09.0318041Z     
2025-05-07T20:32:09.0318240Z         if scale_ub is not None:
2025-05-07T20:32:09.0318536Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.0318891Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.0319211Z             )
2025-05-07T20:32:09.0319547Z         else:
2025-05-07T20:32:09.0319771Z             scale_ub_tensor = None
2025-05-07T20:32:09.0320039Z     
2025-05-07T20:32:09.0320279Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.0320607Z             op = silu_mul_quant
2025-05-07T20:32:09.0320866Z             if compiled:
2025-05-07T20:32:09.0321134Z                 op = torch.compile(op)
2025-05-07T20:32:09.0321485Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0321769Z     
2025-05-07T20:32:09.0321973Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.0322143Z 
2025-05-07T20:32:09.0322248Z moe/activation_test.py:117: 
2025-05-07T20:32:09.0322555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0322895Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.0323189Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0323914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.0324640Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.0325207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.0325910Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.0326600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.0327162Z     kernel = self.compile(
2025-05-07T20:32:09.0327726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.0328407Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.0328902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0329141Z 
2025-05-07T20:32:09.0329364Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d3589d0>
2025-05-07T20:32:09.0330479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.0331953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9cff3880>}
2025-05-07T20:32:09.0333335Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.0334387Z context = <triton._C.libtriton.ir.context object at 0x7faaebdf3670>
2025-05-07T20:32:09.0334689Z 
2025-05-07T20:32:09.0334867Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.0335396Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.0335887Z                            module_map=module_map)
2025-05-07T20:32:09.0336266Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.0336631Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.0336915Z E       ^
2025-05-07T20:32:09.0337556Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.0338091Z 
2025-05-07T20:32:09.0338529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.0339053Z 
2025-05-07T20:32:09.0339160Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.0339595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.0340007Z     T=128,
2025-05-07T20:32:09.0340201Z     D=5120,
2025-05-07T20:32:09.0340404Z     scale_ub=None,
2025-05-07T20:32:09.0340729Z     contiguous=False,
2025-05-07T20:32:09.0340956Z     compiled=True,
2025-05-07T20:32:09.0341170Z )
2025-05-07T20:32:09.0341501Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.0342054Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:09.0342333Z 
2025-05-07T20:32:09.0342413Z     @given(
2025-05-07T20:32:09.0342656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.0342988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.0343299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.0343639Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.0343982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.0344271Z     )
2025-05-07T20:32:09.0344635Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.0345092Z     def test_silu_mul_quant(
2025-05-07T20:32:09.0345340Z         self,
2025-05-07T20:32:09.0345553Z         T: int,
2025-05-07T20:32:09.0345764Z         D: int,
2025-05-07T20:32:09.0345990Z         scale_ub: Optional[float],
2025-05-07T20:32:09.0346267Z         contiguous: bool,
2025-05-07T20:32:09.0346515Z         compiled: bool,
2025-05-07T20:32:09.0346748Z     ) -> None:
2025-05-07T20:32:09.0346968Z         torch.manual_seed(2025)
2025-05-07T20:32:09.0347216Z     
2025-05-07T20:32:09.0347501Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.0347845Z     
2025-05-07T20:32:09.0348050Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.0348355Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.0348672Z         x = x_sign * x_clamp
2025-05-07T20:32:09.0348932Z         x0 = x[:, :D]
2025-05-07T20:32:09.0349163Z         x1 = x[:, D:]
2025-05-07T20:32:09.0349460Z     
2025-05-07T20:32:09.0349662Z         if contiguous:
2025-05-07T20:32:09.0349904Z             x0 = x0.contiguous()
2025-05-07T20:32:09.0350173Z             x1 = x1.contiguous()
2025-05-07T20:32:09.0350424Z     
2025-05-07T20:32:09.0350632Z         if scale_ub is not None:
2025-05-07T20:32:09.0350912Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.0351300Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.0351631Z             )
2025-05-07T20:32:09.0351835Z         else:
2025-05-07T20:32:09.0352057Z             scale_ub_tensor = None
2025-05-07T20:32:09.0352319Z     
2025-05-07T20:32:09.0352568Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.0352884Z             op = silu_mul_quant
2025-05-07T20:32:09.0353144Z             if compiled:
2025-05-07T20:32:09.0353399Z                 op = torch.compile(op)
2025-05-07T20:32:09.0353699Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0353981Z     
2025-05-07T20:32:09.0354191Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.0354359Z 
2025-05-07T20:32:09.0354462Z moe/activation_test.py:117: 
2025-05-07T20:32:09.0354770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0355104Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.0355396Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0356530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.0357357Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.0358335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.0359347Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.0360147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.0360918Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.0361625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.0362344Z     kernel = self.compile(
2025-05-07T20:32:09.0362904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.0363576Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.0363975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0364210Z 
2025-05-07T20:32:09.0364425Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9c566da0>
2025-05-07T20:32:09.0365528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.0366942Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9cff3eb0>}
2025-05-07T20:32:09.0368324Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.0369368Z context = <triton._C.libtriton.ir.context object at 0x7faaebdc4df0>
2025-05-07T20:32:09.0369670Z 
2025-05-07T20:32:09.0369843Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.0370379Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.0370864Z                            module_map=module_map)
2025-05-07T20:32:09.0371242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.0371606Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.0372031Z E       ^
2025-05-07T20:32:09.0372520Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.0372992Z 
2025-05-07T20:32:09.0373417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.0373946Z 
2025-05-07T20:32:09.0374054Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.0374483Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.0374889Z     T=128,
2025-05-07T20:32:09.0375089Z     D=7168,
2025-05-07T20:32:09.0375296Z     scale_ub=1200.0,
2025-05-07T20:32:09.0375523Z     contiguous=False,
2025-05-07T20:32:09.0375757Z     compiled=False,
2025-05-07T20:32:09.0375969Z )
2025-05-07T20:32:09.1634826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.1635458Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:09.1635738Z 
2025-05-07T20:32:09.1635818Z     @given(
2025-05-07T20:32:09.1636060Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.1636386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.1636714Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.1637049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.1637391Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.1637692Z     )
2025-05-07T20:32:09.1638053Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.1638499Z     def test_silu_mul_quant(
2025-05-07T20:32:09.1638748Z         self,
2025-05-07T20:32:09.1638953Z         T: int,
2025-05-07T20:32:09.1639150Z         D: int,
2025-05-07T20:32:09.1639376Z         scale_ub: Optional[float],
2025-05-07T20:32:09.1639662Z         contiguous: bool,
2025-05-07T20:32:09.1639905Z         compiled: bool,
2025-05-07T20:32:09.1640147Z     ) -> None:
2025-05-07T20:32:09.1640373Z         torch.manual_seed(2025)
2025-05-07T20:32:09.1640616Z     
2025-05-07T20:32:09.1640902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.1641456Z     
2025-05-07T20:32:09.1641650Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.1641971Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.1642309Z         x = x_sign * x_clamp
2025-05-07T20:32:09.1642545Z         x0 = x[:, :D]
2025-05-07T20:32:09.1642764Z         x1 = x[:, D:]
2025-05-07T20:32:09.1642975Z     
2025-05-07T20:32:09.1643164Z         if contiguous:
2025-05-07T20:32:09.1643394Z             x0 = x0.contiguous()
2025-05-07T20:32:09.1643654Z             x1 = x1.contiguous()
2025-05-07T20:32:09.1643895Z     
2025-05-07T20:32:09.1644088Z         if scale_ub is not None:
2025-05-07T20:32:09.1644367Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.1644709Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.1645018Z             )
2025-05-07T20:32:09.1645217Z         else:
2025-05-07T20:32:09.1645433Z             scale_ub_tensor = None
2025-05-07T20:32:09.1645691Z     
2025-05-07T20:32:09.1645930Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.1646249Z             op = silu_mul_quant
2025-05-07T20:32:09.1646497Z             if compiled:
2025-05-07T20:32:09.1646750Z                 op = torch.compile(op)
2025-05-07T20:32:09.1647053Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.1647325Z     
2025-05-07T20:32:09.1647521Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.1647691Z 
2025-05-07T20:32:09.1647794Z moe/activation_test.py:117: 
2025-05-07T20:32:09.1648090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.1648413Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.1648699Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.1649518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.1650218Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.1650772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.1651518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.1652189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.1652723Z     kernel = self.compile(
2025-05-07T20:32:09.1653275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.1654029Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.1654478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.1654712Z 
2025-05-07T20:32:09.1654931Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d5c6920>
2025-05-07T20:32:09.1656315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.1657736Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9cff0d30>}
2025-05-07T20:32:09.1659180Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.1660223Z context = <triton._C.libtriton.ir.context object at 0x7faaebeac8f0>
2025-05-07T20:32:09.1660525Z 
2025-05-07T20:32:09.1660694Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.1661238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.1661748Z                            module_map=module_map)
2025-05-07T20:32:09.1662287Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.1662647Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.1662910Z E       ^
2025-05-07T20:32:09.1663379Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.1663840Z 
2025-05-07T20:32:09.1664263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.1664884Z 
2025-05-07T20:32:09.1665040Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.1665618Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.1666023Z     T=128,
2025-05-07T20:32:09.1666215Z     D=5120,
2025-05-07T20:32:09.1666418Z     scale_ub=None,
2025-05-07T20:32:09.1666632Z     contiguous=False,
2025-05-07T20:32:09.1666861Z     compiled=False,
2025-05-07T20:32:09.1667068Z )
2025-05-07T20:32:09.1667395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.1667895Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:09.1668172Z 
2025-05-07T20:32:09.1668250Z     @given(
2025-05-07T20:32:09.1668483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.1668796Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.1669106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.1669440Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.1669772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.1670062Z     )
2025-05-07T20:32:09.1670415Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.1670858Z     def test_silu_mul_quant(
2025-05-07T20:32:09.1671254Z         self,
2025-05-07T20:32:09.1671458Z         T: int,
2025-05-07T20:32:09.1671653Z         D: int,
2025-05-07T20:32:09.1671900Z         scale_ub: Optional[float],
2025-05-07T20:32:09.1672205Z         contiguous: bool,
2025-05-07T20:32:09.1672443Z         compiled: bool,
2025-05-07T20:32:09.1672669Z     ) -> None:
2025-05-07T20:32:09.1672893Z         torch.manual_seed(2025)
2025-05-07T20:32:09.1673135Z     
2025-05-07T20:32:09.1673409Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.1673751Z     
2025-05-07T20:32:09.1673946Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.1674241Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.1674560Z         x = x_sign * x_clamp
2025-05-07T20:32:09.1674803Z         x0 = x[:, :D]
2025-05-07T20:32:09.1675014Z         x1 = x[:, D:]
2025-05-07T20:32:09.1675226Z     
2025-05-07T20:32:09.1675416Z         if contiguous:
2025-05-07T20:32:09.1675652Z             x0 = x0.contiguous()
2025-05-07T20:32:09.1675911Z             x1 = x1.contiguous()
2025-05-07T20:32:09.1676162Z     
2025-05-07T20:32:09.1676351Z         if scale_ub is not None:
2025-05-07T20:32:09.1676636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.1676977Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.1677282Z             )
2025-05-07T20:32:09.1677479Z         else:
2025-05-07T20:32:09.1677691Z             scale_ub_tensor = None
2025-05-07T20:32:09.1677948Z     
2025-05-07T20:32:09.1678179Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.1678495Z             op = silu_mul_quant
2025-05-07T20:32:09.1678748Z             if compiled:
2025-05-07T20:32:09.1678994Z                 op = torch.compile(op)
2025-05-07T20:32:09.1679295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.1679571Z     
2025-05-07T20:32:09.1679761Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.1679932Z 
2025-05-07T20:32:09.1680039Z moe/activation_test.py:117: 
2025-05-07T20:32:09.1680333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.1680660Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.1681037Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.1681789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.1682492Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.1683033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.1683724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.1684395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.1684933Z     kernel = self.compile(
2025-05-07T20:32:09.1685488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.1686157Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.1686565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.1686793Z 
2025-05-07T20:32:09.1687005Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebe592a0>
2025-05-07T20:32:09.1688103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.1689501Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9c9a6cb0>}
2025-05-07T20:32:09.1690953Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.1692054Z context = <triton._C.libtriton.ir.context object at 0x7faaebe900b0>
2025-05-07T20:32:09.1692352Z 
2025-05-07T20:32:09.1692521Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.1693056Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.1693532Z                            module_map=module_map)
2025-05-07T20:32:09.1693898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.1694257Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.1694521Z E       ^
2025-05-07T20:32:09.1694991Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.1695446Z 
2025-05-07T20:32:09.1695876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.1696397Z 
2025-05-07T20:32:09.1696501Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.1696922Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.1697333Z     T=128,
2025-05-07T20:32:09.1697517Z     D=5120,
2025-05-07T20:32:09.1697711Z     scale_ub=1200.0,
2025-05-07T20:32:09.1697945Z     contiguous=True,
2025-05-07T20:32:09.1698270Z     compiled=False,
2025-05-07T20:32:09.1698477Z )
2025-05-07T20:32:09.3631115Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.3631666Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:09.3631948Z 
2025-05-07T20:32:09.3632029Z     @given(
2025-05-07T20:32:09.3632270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.3632588Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.3632894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.3633239Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.3633574Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.3634081Z     )
2025-05-07T20:32:09.3634434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.3634883Z     def test_silu_mul_quant(
2025-05-07T20:32:09.3635128Z         self,
2025-05-07T20:32:09.3635327Z         T: int,
2025-05-07T20:32:09.3635529Z         D: int,
2025-05-07T20:32:09.3635751Z         scale_ub: Optional[float],
2025-05-07T20:32:09.3636022Z         contiguous: bool,
2025-05-07T20:32:09.3636265Z         compiled: bool,
2025-05-07T20:32:09.3636493Z     ) -> None:
2025-05-07T20:32:09.3636707Z         torch.manual_seed(2025)
2025-05-07T20:32:09.3636953Z     
2025-05-07T20:32:09.3637234Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.3637575Z     
2025-05-07T20:32:09.3637779Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.3638079Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.3638396Z         x = x_sign * x_clamp
2025-05-07T20:32:09.3645343Z         x0 = x[:, :D]
2025-05-07T20:32:09.3645609Z         x1 = x[:, D:]
2025-05-07T20:32:09.3645836Z     
2025-05-07T20:32:09.3646027Z         if contiguous:
2025-05-07T20:32:09.3646287Z             x0 = x0.contiguous()
2025-05-07T20:32:09.3646575Z             x1 = x1.contiguous()
2025-05-07T20:32:09.3646831Z     
2025-05-07T20:32:09.3647028Z         if scale_ub is not None:
2025-05-07T20:32:09.3647304Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.3647652Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.3647961Z             )
2025-05-07T20:32:09.3648163Z         else:
2025-05-07T20:32:09.3648382Z             scale_ub_tensor = None
2025-05-07T20:32:09.3648630Z     
2025-05-07T20:32:09.3648870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.3649192Z             op = silu_mul_quant
2025-05-07T20:32:09.3649602Z             if compiled:
2025-05-07T20:32:09.3649862Z                 op = torch.compile(op)
2025-05-07T20:32:09.3650164Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3650438Z     
2025-05-07T20:32:09.3650637Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.3650805Z 
2025-05-07T20:32:09.3650913Z moe/activation_test.py:117: 
2025-05-07T20:32:09.3651206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3651544Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.3651869Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3652597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.3653293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.3653835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.3654534Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.3655203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.3656101Z     kernel = self.compile(
2025-05-07T20:32:09.3656657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.3657327Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.3657723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3657953Z 
2025-05-07T20:32:09.3658239Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9c566380>
2025-05-07T20:32:09.3659342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.3660746Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9c9a64d0>}
2025-05-07T20:32:09.3662280Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.3663322Z context = <triton._C.libtriton.ir.context object at 0x7faaebe03c70>
2025-05-07T20:32:09.3663625Z 
2025-05-07T20:32:09.3663794Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.3664324Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.3664802Z                            module_map=module_map)
2025-05-07T20:32:09.3665163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.3665524Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.3665787Z E       ^
2025-05-07T20:32:09.3666253Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.3666718Z 
2025-05-07T20:32:09.3667138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.3667666Z 
2025-05-07T20:32:09.3667774Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.3668204Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.3668598Z     T=1,
2025-05-07T20:32:09.3668783Z     D=7168,
2025-05-07T20:32:09.3668977Z     scale_ub=1200.0,
2025-05-07T20:32:09.3669195Z     contiguous=True,
2025-05-07T20:32:09.3669423Z     compiled=True,
2025-05-07T20:32:09.3669622Z )
2025-05-07T20:32:09.3669940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.3670544Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:09.3670809Z 
2025-05-07T20:32:09.3670889Z     @given(
2025-05-07T20:32:09.3671116Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.3671441Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.3671790Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.3672122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.3672452Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.3672736Z     )
2025-05-07T20:32:09.3673081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.3673526Z     def test_silu_mul_quant(
2025-05-07T20:32:09.3673770Z         self,
2025-05-07T20:32:09.3673967Z         T: int,
2025-05-07T20:32:09.3674157Z         D: int,
2025-05-07T20:32:09.3674373Z         scale_ub: Optional[float],
2025-05-07T20:32:09.3674644Z         contiguous: bool,
2025-05-07T20:32:09.3674879Z         compiled: bool,
2025-05-07T20:32:09.3675110Z     ) -> None:
2025-05-07T20:32:09.3675326Z         torch.manual_seed(2025)
2025-05-07T20:32:09.3675561Z     
2025-05-07T20:32:09.3675834Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.3676177Z     
2025-05-07T20:32:09.3676365Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.3676657Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.3676965Z         x = x_sign * x_clamp
2025-05-07T20:32:09.3677203Z         x0 = x[:, :D]
2025-05-07T20:32:09.3677421Z         x1 = x[:, D:]
2025-05-07T20:32:09.3677633Z     
2025-05-07T20:32:09.3677812Z         if contiguous:
2025-05-07T20:32:09.3678040Z             x0 = x0.contiguous()
2025-05-07T20:32:09.3678296Z             x1 = x1.contiguous()
2025-05-07T20:32:09.3678530Z     
2025-05-07T20:32:09.3678720Z         if scale_ub is not None:
2025-05-07T20:32:09.3678993Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.3679328Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.3679635Z             )
2025-05-07T20:32:09.3679844Z         else:
2025-05-07T20:32:09.3680052Z             scale_ub_tensor = None
2025-05-07T20:32:09.3680386Z     
2025-05-07T20:32:09.3680618Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.3680933Z             op = silu_mul_quant
2025-05-07T20:32:09.3681184Z             if compiled:
2025-05-07T20:32:09.3681427Z                 op = torch.compile(op)
2025-05-07T20:32:09.3681748Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3682052Z     
2025-05-07T20:32:09.3682242Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.3682414Z 
2025-05-07T20:32:09.3682515Z moe/activation_test.py:117: 
2025-05-07T20:32:09.3682809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3683127Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.3683408Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3683984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.3684543Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.3685211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.3685914Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.3686456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.3687138Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.3687817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.3688359Z     kernel = self.compile(
2025-05-07T20:32:09.3689050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.3689812Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.3690214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3690439Z 
2025-05-07T20:32:09.3690664Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebea6d40>
2025-05-07T20:32:09.3691791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.3693389Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9c9a79a0>}
2025-05-07T20:32:09.3695082Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.3696359Z context = <triton._C.libtriton.ir.context object at 0x7faaebbf0430>
2025-05-07T20:32:09.3696702Z 
2025-05-07T20:32:09.3696891Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.3697503Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.3698121Z                            module_map=module_map)
2025-05-07T20:32:09.3698493Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.3698851Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.3699106Z E       ^
2025-05-07T20:32:09.3699579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.3700034Z 
2025-05-07T20:32:09.3700461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.3700982Z 
2025-05-07T20:32:09.3701088Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.3701521Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.3701978Z     T=1,
2025-05-07T20:32:09.3702162Z     D=7168,
2025-05-07T20:32:09.3702460Z     scale_ub=1200.0,
2025-05-07T20:32:09.3702792Z     contiguous=False,
2025-05-07T20:32:09.3703137Z     compiled=True,
2025-05-07T20:32:09.3703424Z )
2025-05-07T20:32:09.5085368Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.5085890Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:09.5086187Z 
2025-05-07T20:32:09.5086272Z     @given(
2025-05-07T20:32:09.5086507Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.5086825Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.5087142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.5087479Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.5087809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.5088108Z     )
2025-05-07T20:32:09.5088462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.5088912Z     def test_silu_mul_quant(
2025-05-07T20:32:09.5089162Z         self,
2025-05-07T20:32:09.5089361Z         T: int,
2025-05-07T20:32:09.5089555Z         D: int,
2025-05-07T20:32:09.5089783Z         scale_ub: Optional[float],
2025-05-07T20:32:09.5090059Z         contiguous: bool,
2025-05-07T20:32:09.5090297Z         compiled: bool,
2025-05-07T20:32:09.5090526Z     ) -> None:
2025-05-07T20:32:09.5090747Z         torch.manual_seed(2025)
2025-05-07T20:32:09.5090989Z     
2025-05-07T20:32:09.5091271Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.5091617Z     
2025-05-07T20:32:09.5091827Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.5092171Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.5092482Z         x = x_sign * x_clamp
2025-05-07T20:32:09.5092898Z         x0 = x[:, :D]
2025-05-07T20:32:09.5093123Z         x1 = x[:, D:]
2025-05-07T20:32:09.5093334Z     
2025-05-07T20:32:09.5093527Z         if contiguous:
2025-05-07T20:32:09.5093766Z             x0 = x0.contiguous()
2025-05-07T20:32:09.5094027Z             x1 = x1.contiguous()
2025-05-07T20:32:09.5094276Z     
2025-05-07T20:32:09.5094471Z         if scale_ub is not None:
2025-05-07T20:32:09.5094752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.5095099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.5095406Z             )
2025-05-07T20:32:09.5095599Z         else:
2025-05-07T20:32:09.5095821Z             scale_ub_tensor = None
2025-05-07T20:32:09.5096072Z     
2025-05-07T20:32:09.5096308Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.5096629Z             op = silu_mul_quant
2025-05-07T20:32:09.5096891Z             if compiled:
2025-05-07T20:32:09.5097147Z                 op = torch.compile(op)
2025-05-07T20:32:09.5097456Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.5097740Z     
2025-05-07T20:32:09.5097935Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.5098239Z 
2025-05-07T20:32:09.5098354Z moe/activation_test.py:117: 
2025-05-07T20:32:09.5098658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.5098992Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.5099277Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.5099848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.5100417Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.5101078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.5101777Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.5102329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.5103152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.5103834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.5104553Z     kernel = self.compile(
2025-05-07T20:32:09.5105109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.5105773Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.5106183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.5106416Z 
2025-05-07T20:32:09.5106631Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebb4a0e0>
2025-05-07T20:32:09.5107737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.5109138Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebbf4430>}
2025-05-07T20:32:09.5110507Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.5111558Z context = <triton._C.libtriton.ir.context object at 0x7faaebbbf4b0>
2025-05-07T20:32:09.5111905Z 
2025-05-07T20:32:09.5112075Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.5112604Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.5113081Z                            module_map=module_map)
2025-05-07T20:32:09.5113453Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.5113896Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.5114158Z E       ^
2025-05-07T20:32:09.5114634Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.5115105Z 
2025-05-07T20:32:09.5115530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.5116051Z 
2025-05-07T20:32:09.5116163Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.5116582Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.5116989Z     T=1,
2025-05-07T20:32:09.5117181Z     D=7168,
2025-05-07T20:32:09.5117376Z     scale_ub=None,
2025-05-07T20:32:09.5117594Z     contiguous=False,
2025-05-07T20:32:09.5117825Z     compiled=True,
2025-05-07T20:32:09.5118030Z )
2025-05-07T20:32:09.6084713Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.6085321Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:09.6085595Z 
2025-05-07T20:32:09.6085678Z     @given(
2025-05-07T20:32:09.6085916Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.6086237Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.6086549Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.6086885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.6087216Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.6087503Z     )
2025-05-07T20:32:09.6087858Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.6088304Z     def test_silu_mul_quant(
2025-05-07T20:32:09.6088548Z         self,
2025-05-07T20:32:09.6088746Z         T: int,
2025-05-07T20:32:09.6088940Z         D: int,
2025-05-07T20:32:09.6089165Z         scale_ub: Optional[float],
2025-05-07T20:32:09.6089440Z         contiguous: bool,
2025-05-07T20:32:09.6089685Z         compiled: bool,
2025-05-07T20:32:09.6089912Z     ) -> None:
2025-05-07T20:32:09.6090134Z         torch.manual_seed(2025)
2025-05-07T20:32:09.6090375Z     
2025-05-07T20:32:09.6090831Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.6091176Z     
2025-05-07T20:32:09.6091374Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.6091665Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.6091985Z         x = x_sign * x_clamp
2025-05-07T20:32:09.6092223Z         x0 = x[:, :D]
2025-05-07T20:32:09.6092439Z         x1 = x[:, D:]
2025-05-07T20:32:09.6092650Z     
2025-05-07T20:32:09.6092843Z         if contiguous:
2025-05-07T20:32:09.6093074Z             x0 = x0.contiguous()
2025-05-07T20:32:09.6093335Z             x1 = x1.contiguous()
2025-05-07T20:32:09.6093575Z     
2025-05-07T20:32:09.6093768Z         if scale_ub is not None:
2025-05-07T20:32:09.6094045Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.6094396Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.6094704Z             )
2025-05-07T20:32:09.6094897Z         else:
2025-05-07T20:32:09.6095120Z             scale_ub_tensor = None
2025-05-07T20:32:09.6095371Z     
2025-05-07T20:32:09.6095608Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.6095927Z             op = silu_mul_quant
2025-05-07T20:32:09.6096179Z             if compiled:
2025-05-07T20:32:09.6096424Z                 op = torch.compile(op)
2025-05-07T20:32:09.6096727Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.6097004Z     
2025-05-07T20:32:09.6097196Z         y_fp8, y_scale = fn()
2025-05-07T20:32:09.6097491Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:09.6097787Z     
2025-05-07T20:32:09.6098204Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.6098543Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:09.6098838Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:09.6099281Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:09.6099644Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.6099963Z     
2025-05-07T20:32:09.6100171Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:09.6100371Z 
2025-05-07T20:32:09.6100471Z moe/activation_test.py:126: 
2025-05-07T20:32:09.6100770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.6101107Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:09.6101440Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.6102298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:09.6103067Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:09.6103628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.6104326Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.6105028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:09.6105766Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.6106532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:09.6107286Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.6108026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:09.6108677Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:09.6109298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:09.6109818Z     fn()
2025-05-07T20:32:09.6110339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:09.6111016Z     self.fn.run(
2025-05-07T20:32:09.6111489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.6112055Z     kernel = self.compile(
2025-05-07T20:32:09.6112629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.6113297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.6113691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.6113922Z 
2025-05-07T20:32:09.6114135Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebbde500>
2025-05-07T20:32:09.6115247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.6116661Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebbf4a60>}
2025-05-07T20:32:09.6118028Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.6119072Z context = <triton._C.libtriton.ir.context object at 0x7faaebc0f530>
2025-05-07T20:32:09.6119368Z 
2025-05-07T20:32:09.6119538Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.6120071Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.6120673Z                            module_map=module_map)
2025-05-07T20:32:09.6121044Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.6121448Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:09.6121729Z E       ^
2025-05-07T20:32:09.6122200Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.6122661Z 
2025-05-07T20:32:09.6123081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.6123602Z 
2025-05-07T20:32:09.6123714Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.6124132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.6124537Z     T=1,
2025-05-07T20:32:09.6124721Z     D=5120,
2025-05-07T20:32:09.6124911Z     scale_ub=1200.0,
2025-05-07T20:32:09.6125137Z     contiguous=False,
2025-05-07T20:32:09.6125365Z     compiled=True,
2025-05-07T20:32:09.6125573Z )
2025-05-07T20:32:09.9519915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.9520453Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:09.9520731Z 
2025-05-07T20:32:09.9520820Z     @given(
2025-05-07T20:32:09.9521070Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.9521394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.9521705Z         scale_ub=st.sampled_from([None, 1200.00]),
﻿2025-05-07T20:32:09.9527584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.9528026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.9528314Z     )
2025-05-07T20:32:09.9528674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.9529126Z     def test_silu_mul_quant(
2025-05-07T20:32:09.9529368Z         self,
2025-05-07T20:32:09.9529566Z         T: int,
2025-05-07T20:32:09.9529768Z         D: int,
2025-05-07T20:32:09.9529994Z         scale_ub: Optional[float],
2025-05-07T20:32:09.9530270Z         contiguous: bool,
2025-05-07T20:32:09.9530515Z         compiled: bool,
2025-05-07T20:32:09.9530867Z     ) -> None:
2025-05-07T20:32:09.9531097Z         torch.manual_seed(2025)
2025-05-07T20:32:09.9531348Z     
2025-05-07T20:32:09.9531633Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.9532004Z     
2025-05-07T20:32:09.9532228Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.9532523Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.9532850Z         x = x_sign * x_clamp
2025-05-07T20:32:09.9533095Z         x0 = x[:, :D]
2025-05-07T20:32:09.9533316Z         x1 = x[:, D:]
2025-05-07T20:32:09.9533524Z     
2025-05-07T20:32:09.9533715Z         if contiguous:
2025-05-07T20:32:09.9533949Z             x0 = x0.contiguous()
2025-05-07T20:32:09.9534209Z             x1 = x1.contiguous()
2025-05-07T20:32:09.9534453Z     
2025-05-07T20:32:09.9534654Z         if scale_ub is not None:
2025-05-07T20:32:09.9534927Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.9535268Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.9535584Z             )
2025-05-07T20:32:09.9535774Z         else:
2025-05-07T20:32:09.9535987Z             scale_ub_tensor = None
2025-05-07T20:32:09.9536246Z     
2025-05-07T20:32:09.9536490Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.9536812Z             op = silu_mul_quant
2025-05-07T20:32:09.9537068Z             if compiled:
2025-05-07T20:32:09.9537316Z                 op = torch.compile(op)
2025-05-07T20:32:09.9537630Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.9537914Z     
2025-05-07T20:32:09.9538214Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.9538382Z 
2025-05-07T20:32:09.9538522Z moe/activation_test.py:117: 
2025-05-07T20:32:09.9545835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.9546355Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.9546660Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.9547236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.9547812Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.9548489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.9549185Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.9549733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.9550429Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.9551109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.9551645Z     kernel = self.compile(
2025-05-07T20:32:09.9552215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.9552888Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.9553298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.9553523Z 
2025-05-07T20:32:09.9553740Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebbac3d0>
2025-05-07T20:32:09.9554850Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.9556676Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebbf52d0>}
2025-05-07T20:32:09.9558380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.9559521Z context = <triton._C.libtriton.ir.context object at 0x7faaebc9d170>
2025-05-07T20:32:09.9559821Z 
2025-05-07T20:32:09.9559991Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.9560525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.9561012Z                            module_map=module_map)
2025-05-07T20:32:09.9561382Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.9561749Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.9562045Z E       ^
2025-05-07T20:32:09.9562532Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.9562998Z 
2025-05-07T20:32:09.9563425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.9563950Z 
2025-05-07T20:32:09.9564057Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.9564484Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.9564885Z     T=1,
2025-05-07T20:32:09.9565077Z     D=5120,
2025-05-07T20:32:09.9565276Z     scale_ub=1200.0,
2025-05-07T20:32:09.9565501Z     contiguous=False,
2025-05-07T20:32:09.9565731Z     compiled=False,
2025-05-07T20:32:09.9565953Z )
2025-05-07T20:32:09.9566271Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.9566766Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:09.9567040Z 
2025-05-07T20:32:09.9567118Z     @given(
2025-05-07T20:32:09.9567351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.9567664Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.9568100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.9568437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.9568765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.9569060Z     )
2025-05-07T20:32:09.9569417Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.9569862Z     def test_silu_mul_quant(
2025-05-07T20:32:09.9570108Z         self,
2025-05-07T20:32:09.9570309Z         T: int,
2025-05-07T20:32:09.9570503Z         D: int,
2025-05-07T20:32:09.9570733Z         scale_ub: Optional[float],
2025-05-07T20:32:09.9571016Z         contiguous: bool,
2025-05-07T20:32:09.9571262Z         compiled: bool,
2025-05-07T20:32:09.9571483Z     ) -> None:
2025-05-07T20:32:09.9571700Z         torch.manual_seed(2025)
2025-05-07T20:32:09.9571940Z     
2025-05-07T20:32:09.9572217Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.9572564Z     
2025-05-07T20:32:09.9572756Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.9573058Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.9573367Z         x = x_sign * x_clamp
2025-05-07T20:32:09.9573607Z         x0 = x[:, :D]
2025-05-07T20:32:09.9573823Z         x1 = x[:, D:]
2025-05-07T20:32:09.9574030Z     
2025-05-07T20:32:09.9574212Z         if contiguous:
2025-05-07T20:32:09.9574451Z             x0 = x0.contiguous()
2025-05-07T20:32:09.9574713Z             x1 = x1.contiguous()
2025-05-07T20:32:09.9574959Z     
2025-05-07T20:32:09.9575154Z         if scale_ub is not None:
2025-05-07T20:32:09.9575531Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.9575875Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.9576187Z             )
2025-05-07T20:32:09.9576382Z         else:
2025-05-07T20:32:09.9576597Z             scale_ub_tensor = None
2025-05-07T20:32:09.9576848Z     
2025-05-07T20:32:09.9577086Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.9577417Z             op = silu_mul_quant
2025-05-07T20:32:09.9577664Z             if compiled:
2025-05-07T20:32:09.9577913Z                 op = torch.compile(op)
2025-05-07T20:32:09.9578331Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.9578604Z     
2025-05-07T20:32:09.9578802Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.9578968Z 
2025-05-07T20:32:09.9579076Z moe/activation_test.py:117: 
2025-05-07T20:32:09.9579373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.9579703Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.9579991Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.9580693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.9581390Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.9581948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.9582682Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.9583355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.9583897Z     kernel = self.compile(
2025-05-07T20:32:09.9584450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.9585120Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.9585514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.9585747Z 
2025-05-07T20:32:09.9585957Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebc6c0a0>
2025-05-07T20:32:09.9587386Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.9588873Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebbf5b40>}
2025-05-07T20:32:09.9590254Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.9591299Z context = <triton._C.libtriton.ir.context object at 0x7faaeb816070>
2025-05-07T20:32:09.9591603Z 
2025-05-07T20:32:09.9591769Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.9592340Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.9592830Z                            module_map=module_map)
2025-05-07T20:32:09.9593194Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.9593560Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.9593821Z E       ^
2025-05-07T20:32:09.9594288Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.9594755Z 
2025-05-07T20:32:09.9595177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.9595701Z 
2025-05-07T20:32:09.9595809Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.9596294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.9596699Z     T=16384,
2025-05-07T20:32:09.9596900Z     D=5120,
2025-05-07T20:32:09.9597098Z     scale_ub=1200.0,
2025-05-07T20:32:09.9597324Z     contiguous=False,
2025-05-07T20:32:09.9597554Z     compiled=True,
2025-05-07T20:32:09.9597764Z )
2025-05-07T20:32:10.0591177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.0592607Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:10.0593211Z 
2025-05-07T20:32:10.0593352Z     @given(
2025-05-07T20:32:10.0593772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.0594623Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.0595109Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.0595613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.0596177Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.0596674Z     )
2025-05-07T20:32:10.0597298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.0598078Z     def test_silu_mul_quant(
2025-05-07T20:32:10.0598493Z         self,
2025-05-07T20:32:10.0598830Z         T: int,
2025-05-07T20:32:10.0599167Z         D: int,
2025-05-07T20:32:10.0599524Z         scale_ub: Optional[float],
2025-05-07T20:32:10.0599998Z         contiguous: bool,
2025-05-07T20:32:10.0600419Z         compiled: bool,
2025-05-07T20:32:10.0600796Z     ) -> None:
2025-05-07T20:32:10.0601172Z         torch.manual_seed(2025)
2025-05-07T20:32:10.0601652Z     
2025-05-07T20:32:10.0602122Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.0602715Z     
2025-05-07T20:32:10.0603059Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.0603559Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.0604085Z         x = x_sign * x_clamp
2025-05-07T20:32:10.0604504Z         x0 = x[:, :D]
2025-05-07T20:32:10.0604883Z         x1 = x[:, D:]
2025-05-07T20:32:10.0605232Z     
2025-05-07T20:32:10.0605554Z         if contiguous:
2025-05-07T20:32:10.0605951Z             x0 = x0.contiguous()
2025-05-07T20:32:10.0606389Z             x1 = x1.contiguous()
2025-05-07T20:32:10.0606809Z     
2025-05-07T20:32:10.0607141Z         if scale_ub is not None:
2025-05-07T20:32:10.0607606Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.0608485Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.0609033Z             )
2025-05-07T20:32:10.0609352Z         else:
2025-05-07T20:32:10.0609717Z             scale_ub_tensor = None
2025-05-07T20:32:10.0610162Z     
2025-05-07T20:32:10.0610545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.0611092Z             op = silu_mul_quant
2025-05-07T20:32:10.0611522Z             if compiled:
2025-05-07T20:32:10.0611992Z                 op = torch.compile(op)
2025-05-07T20:32:10.0612503Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0612998Z     
2025-05-07T20:32:10.0613336Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.0613618Z 
2025-05-07T20:32:10.0613786Z moe/activation_test.py:117: 
2025-05-07T20:32:10.0614306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0614893Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.0615370Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0616362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.0617333Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.0618614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.0619790Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.0620719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.0622055Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.0623199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.0624146Z     kernel = self.compile(
2025-05-07T20:32:10.0625114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.0626294Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.0626983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0627471Z 
2025-05-07T20:32:10.0627827Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb893130>
2025-05-07T20:32:10.0629772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.0632359Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebbf6cb0>}
2025-05-07T20:32:10.0634812Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.0636618Z context = <triton._C.libtriton.ir.context object at 0x7faaeb857670>
2025-05-07T20:32:10.0637125Z 
2025-05-07T20:32:10.0637430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.0638362Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.0639199Z                            module_map=module_map)
2025-05-07T20:32:10.0639823Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.0640439Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.0640891Z E       ^
2025-05-07T20:32:10.0641751Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.0642585Z 
2025-05-07T20:32:10.0643323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.0644259Z 
2025-05-07T20:32:10.0644550Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.0645280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.0645982Z     T=2048,
2025-05-07T20:32:10.0646302Z     D=7168,
2025-05-07T20:32:10.0646628Z     scale_ub=1200.0,
2025-05-07T20:32:10.0647007Z     contiguous=False,
2025-05-07T20:32:10.0647399Z     compiled=True,
2025-05-07T20:32:10.0647757Z )
2025-05-07T20:32:10.0648306Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.0649167Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:10.0649660Z 
2025-05-07T20:32:10.0649787Z     @given(
2025-05-07T20:32:10.0650174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.0650704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.0651237Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.0651813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.0652385Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.0652880Z     )
2025-05-07T20:32:10.0653492Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.0654274Z     def test_silu_mul_quant(
2025-05-07T20:32:10.0654679Z         self,
2025-05-07T20:32:10.0655011Z         T: int,
2025-05-07T20:32:10.0655344Z         D: int,
2025-05-07T20:32:10.0656718Z         scale_ub: Optional[float],
2025-05-07T20:32:10.0657047Z         contiguous: bool,
2025-05-07T20:32:10.0657453Z         compiled: bool,
2025-05-07T20:32:10.0657678Z     ) -> None:
2025-05-07T20:32:10.0657897Z         torch.manual_seed(2025)
2025-05-07T20:32:10.0658237Z     
2025-05-07T20:32:10.0658514Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.0658866Z     
2025-05-07T20:32:10.0659067Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.0659358Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.0659677Z         x = x_sign * x_clamp
2025-05-07T20:32:10.0659920Z         x0 = x[:, :D]
2025-05-07T20:32:10.0660131Z         x1 = x[:, D:]
2025-05-07T20:32:10.0660335Z     
2025-05-07T20:32:10.0660658Z         if contiguous:
2025-05-07T20:32:10.0660888Z             x0 = x0.contiguous()
2025-05-07T20:32:10.0661146Z             x1 = x1.contiguous()
2025-05-07T20:32:10.0661388Z     
2025-05-07T20:32:10.0661579Z         if scale_ub is not None:
2025-05-07T20:32:10.0661862Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.0662239Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.0662555Z             )
2025-05-07T20:32:10.0662745Z         else:
2025-05-07T20:32:10.0662957Z             scale_ub_tensor = None
2025-05-07T20:32:10.0663208Z     
2025-05-07T20:32:10.0663438Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.0663757Z             op = silu_mul_quant
2025-05-07T20:32:10.0664006Z             if compiled:
2025-05-07T20:32:10.0664255Z                 op = torch.compile(op)
2025-05-07T20:32:10.0664557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0664831Z     
2025-05-07T20:32:10.0665021Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.0665193Z 
2025-05-07T20:32:10.0665292Z moe/activation_test.py:117: 
2025-05-07T20:32:10.0665594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0665923Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.0666205Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0666780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.0667346Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.0668005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.0668704Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.0669384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.0670080Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.0670750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.0671286Z     kernel = self.compile(
2025-05-07T20:32:10.0671833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.0672548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.0672945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0673179Z 
2025-05-07T20:32:10.0673390Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb8832b0>
2025-05-07T20:32:10.0674497Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.0675905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebbf7b50>}
2025-05-07T20:32:10.0677272Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.0678363Z context = <triton._C.libtriton.ir.context object at 0x7faaebf81ff0>
2025-05-07T20:32:10.0678658Z 
2025-05-07T20:32:10.0678827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.0679353Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.0679822Z                            module_map=module_map)
2025-05-07T20:32:10.0680199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.0680553Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.0680809Z E       ^
2025-05-07T20:32:10.0681323Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.0681784Z 
2025-05-07T20:32:10.0682205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.0682722Z 
2025-05-07T20:32:10.1969175Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.1969668Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.1970082Z     T=1,
2025-05-07T20:32:10.1970273Z     D=5120,
2025-05-07T20:32:10.1970470Z     scale_ub=None,
2025-05-07T20:32:10.1970683Z     contiguous=False,
2025-05-07T20:32:10.1970919Z     compiled=False,
2025-05-07T20:32:10.1971127Z )
2025-05-07T20:32:10.1971462Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.1971966Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:10.1972243Z 
2025-05-07T20:32:10.1972328Z     @given(
2025-05-07T20:32:10.1972565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.1972880Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.1973193Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.1973536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.1973873Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.1974163Z     )
2025-05-07T20:32:10.1974525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.1974977Z     def test_silu_mul_quant(
2025-05-07T20:32:10.1975217Z         self,
2025-05-07T20:32:10.1975413Z         T: int,
2025-05-07T20:32:10.1975614Z         D: int,
2025-05-07T20:32:10.1975830Z         scale_ub: Optional[float],
2025-05-07T20:32:10.1976292Z         contiguous: bool,
2025-05-07T20:32:10.1976589Z         compiled: bool,
2025-05-07T20:32:10.1976928Z     ) -> None:
2025-05-07T20:32:10.1977249Z         torch.manual_seed(2025)
2025-05-07T20:32:10.1977536Z     
2025-05-07T20:32:10.1977812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.1978242Z     
2025-05-07T20:32:10.1978441Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.1978733Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.1979057Z         x = x_sign * x_clamp
2025-05-07T20:32:10.1979300Z         x0 = x[:, :D]
2025-05-07T20:32:10.1979515Z         x1 = x[:, D:]
2025-05-07T20:32:10.1979726Z     
2025-05-07T20:32:10.1979916Z         if contiguous:
2025-05-07T20:32:10.1980148Z             x0 = x0.contiguous()
2025-05-07T20:32:10.1980412Z             x1 = x1.contiguous()
2025-05-07T20:32:10.1980661Z     
2025-05-07T20:32:10.1980851Z         if scale_ub is not None:
2025-05-07T20:32:10.1981138Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.1981481Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.1981795Z             )
2025-05-07T20:32:10.1981990Z         else:
2025-05-07T20:32:10.1982198Z             scale_ub_tensor = None
2025-05-07T20:32:10.1982454Z     
2025-05-07T20:32:10.1982686Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.1983007Z             op = silu_mul_quant
2025-05-07T20:32:10.1983258Z             if compiled:
2025-05-07T20:32:10.1983505Z                 op = torch.compile(op)
2025-05-07T20:32:10.1983911Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.1984188Z     
2025-05-07T20:32:10.1984381Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.1984554Z 
2025-05-07T20:32:10.1984659Z moe/activation_test.py:117: 
2025-05-07T20:32:10.1984963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.1985295Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.1985585Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.1986302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.1987080Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.1987624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.1988324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.1989005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.1989550Z     kernel = self.compile(
2025-05-07T20:32:10.1990099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.1990773Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.1991176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.1991405Z 
2025-05-07T20:32:10.1991622Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb8834f0>
2025-05-07T20:32:10.1992737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.1994154Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebfe85e0>}
2025-05-07T20:32:10.1995544Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.1996592Z context = <triton._C.libtriton.ir.context object at 0x7faaebfdbbf0>
2025-05-07T20:32:10.1996885Z 
2025-05-07T20:32:10.1997135Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.1997670Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.1998156Z                            module_map=module_map)
2025-05-07T20:32:10.1998530Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.1998886Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.1999150Z E       ^
2025-05-07T20:32:10.1999620Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.2000081Z 
2025-05-07T20:32:10.2000506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.2001034Z 
2025-05-07T20:32:10.2001140Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.2001594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.2001997Z     T=4096,
2025-05-07T20:32:10.2002184Z     D=7168,
2025-05-07T20:32:10.2002374Z     scale_ub=1200.0,
2025-05-07T20:32:10.2002606Z     contiguous=False,
2025-05-07T20:32:10.2002834Z     compiled=False,
2025-05-07T20:32:10.2003036Z )
2025-05-07T20:32:10.2003358Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.2003860Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:10.2004136Z 
2025-05-07T20:32:10.2004267Z     @given(
2025-05-07T20:32:10.2004494Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.2004812Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.2005118Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.2005453Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.2005789Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.2006073Z     )
2025-05-07T20:32:10.2006437Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.2006885Z     def test_silu_mul_quant(
2025-05-07T20:32:10.2007134Z         self,
2025-05-07T20:32:10.2007371Z         T: int,
2025-05-07T20:32:10.2007571Z         D: int,
2025-05-07T20:32:10.2014822Z         scale_ub: Optional[float],
2025-05-07T20:32:10.2015108Z         contiguous: bool,
2025-05-07T20:32:10.2015354Z         compiled: bool,
2025-05-07T20:32:10.2015584Z     ) -> None:
2025-05-07T20:32:10.2015795Z         torch.manual_seed(2025)
2025-05-07T20:32:10.2016044Z     
2025-05-07T20:32:10.2016325Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.2016665Z     
2025-05-07T20:32:10.2016856Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.2017149Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.2017461Z         x = x_sign * x_clamp
2025-05-07T20:32:10.2017694Z         x0 = x[:, :D]
2025-05-07T20:32:10.2017906Z         x1 = x[:, D:]
2025-05-07T20:32:10.2018171Z     
2025-05-07T20:32:10.2018354Z         if contiguous:
2025-05-07T20:32:10.2018590Z             x0 = x0.contiguous()
2025-05-07T20:32:10.2018849Z             x1 = x1.contiguous()
2025-05-07T20:32:10.2019083Z     
2025-05-07T20:32:10.2019272Z         if scale_ub is not None:
2025-05-07T20:32:10.2019542Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.2019875Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.2020182Z             )
2025-05-07T20:32:10.2020374Z         else:
2025-05-07T20:32:10.2020577Z             scale_ub_tensor = None
2025-05-07T20:32:10.2020826Z     
2025-05-07T20:32:10.2021060Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.2021367Z             op = silu_mul_quant
2025-05-07T20:32:10.2021613Z             if compiled:
2025-05-07T20:32:10.2021862Z                 op = torch.compile(op)
2025-05-07T20:32:10.2022185Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.2022477Z     
2025-05-07T20:32:10.2022778Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.2022946Z 
2025-05-07T20:32:10.2023050Z moe/activation_test.py:117: 
2025-05-07T20:32:10.2023341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.2023669Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.2023950Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.2024641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.2025346Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.2025889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.2026577Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.2027242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.2027784Z     kernel = self.compile(
2025-05-07T20:32:10.2028332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.2028999Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.2029394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.2029626Z 
2025-05-07T20:32:10.2029837Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebf504c0>
2025-05-07T20:32:10.2030986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.2032382Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebfe8ca0>}
2025-05-07T20:32:10.2033748Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.2034827Z context = <triton._C.libtriton.ir.context object at 0x7faaebfbacf0>
2025-05-07T20:32:10.2035123Z 
2025-05-07T20:32:10.2035289Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.2035814Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.2036292Z                            module_map=module_map)
2025-05-07T20:32:10.2036656Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.2037009Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.2037264Z E       ^
2025-05-07T20:32:10.2037727Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.2038186Z 
2025-05-07T20:32:10.2038611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.2039135Z 
2025-05-07T20:32:10.2039237Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.2039651Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.2040047Z     T=16384,
2025-05-07T20:32:10.2040242Z     D=7168,
2025-05-07T20:32:10.2040433Z     scale_ub=None,
2025-05-07T20:32:10.2040643Z     contiguous=True,
2025-05-07T20:32:10.2040871Z     compiled=True,
2025-05-07T20:32:10.2041069Z )
2025-05-07T20:32:10.3984905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.3985595Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:10.3985886Z 
2025-05-07T20:32:10.3985979Z     @given(
2025-05-07T20:32:10.3986223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.3986927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.3987260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.3987603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.3987959Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.3988263Z     )
2025-05-07T20:32:10.3988627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.3989094Z     def test_silu_mul_quant(
2025-05-07T20:32:10.3989351Z         self,
2025-05-07T20:32:10.3989559Z         T: int,
2025-05-07T20:32:10.3989768Z         D: int,
2025-05-07T20:32:10.3990001Z         scale_ub: Optional[float],
2025-05-07T20:32:10.3990280Z         contiguous: bool,
2025-05-07T20:32:10.3990536Z         compiled: bool,
2025-05-07T20:32:10.3990782Z     ) -> None:
2025-05-07T20:32:10.3991010Z         torch.manual_seed(2025)
2025-05-07T20:32:10.3991271Z     
2025-05-07T20:32:10.3991634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.3992009Z     
2025-05-07T20:32:10.3992224Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.3992526Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.3992857Z         x = x_sign * x_clamp
2025-05-07T20:32:10.3993118Z         x0 = x[:, :D]
2025-05-07T20:32:10.3993350Z         x1 = x[:, D:]
2025-05-07T20:32:10.3993566Z     
2025-05-07T20:32:10.3993772Z         if contiguous:
2025-05-07T20:32:10.3994028Z             x0 = x0.contiguous()
2025-05-07T20:32:10.3994290Z             x1 = x1.contiguous()
2025-05-07T20:32:10.3994652Z     
2025-05-07T20:32:10.3994864Z         if scale_ub is not None:
2025-05-07T20:32:10.3995144Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.3995498Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.3995817Z             )
2025-05-07T20:32:10.3996010Z         else:
2025-05-07T20:32:10.3996228Z             scale_ub_tensor = None
2025-05-07T20:32:10.3996487Z     
2025-05-07T20:32:10.3996729Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.3997052Z             op = silu_mul_quant
2025-05-07T20:32:10.3997307Z             if compiled:
2025-05-07T20:32:10.3997643Z                 op = torch.compile(op)
2025-05-07T20:32:10.3997954Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.3998236Z     
2025-05-07T20:32:10.3998436Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.3998604Z 
2025-05-07T20:32:10.3998708Z moe/activation_test.py:117: 
2025-05-07T20:32:10.3999012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.3999354Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.3999643Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.4000227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.4000805Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.4001487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.4002246Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.4002801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.4003499Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.4004178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.4004729Z     kernel = self.compile(
2025-05-07T20:32:10.4005297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.4005977Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.4006378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.4006612Z 
2025-05-07T20:32:10.4006911Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebfbefe0>
2025-05-07T20:32:10.4008028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.4009451Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebfe9b40>}
2025-05-07T20:32:10.4010828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.4011885Z context = <triton._C.libtriton.ir.context object at 0x7faaeba938b0>
2025-05-07T20:32:10.4012188Z 
2025-05-07T20:32:10.4012360Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.4012902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.4013382Z                            module_map=module_map)
2025-05-07T20:32:10.4013764Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.4014132Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.4014396Z E       ^
2025-05-07T20:32:10.4014881Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.4015398Z 
2025-05-07T20:32:10.4015822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.4016341Z 
2025-05-07T20:32:10.4016459Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.4016880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.4017295Z     T=4096,
2025-05-07T20:32:10.4017492Z     D=5120,
2025-05-07T20:32:10.4017692Z     scale_ub=None,
2025-05-07T20:32:10.4017920Z     contiguous=False,
2025-05-07T20:32:10.4018287Z     compiled=True,
2025-05-07T20:32:10.4018503Z )
2025-05-07T20:32:10.4018883Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.4019388Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:10.4019663Z 
2025-05-07T20:32:10.4019745Z     @given(
2025-05-07T20:32:10.4019978Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.4020303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.4020627Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.4020965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.4021322Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.4021652Z     )
2025-05-07T20:32:10.4022013Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.4022457Z     def test_silu_mul_quant(
2025-05-07T20:32:10.4022705Z         self,
2025-05-07T20:32:10.4022907Z         T: int,
2025-05-07T20:32:10.4023103Z         D: int,
2025-05-07T20:32:10.4023348Z         scale_ub: Optional[float],
2025-05-07T20:32:10.4023625Z         contiguous: bool,
2025-05-07T20:32:10.4023880Z         compiled: bool,
2025-05-07T20:32:10.4024114Z     ) -> None:
2025-05-07T20:32:10.4024341Z         torch.manual_seed(2025)
2025-05-07T20:32:10.4024584Z     
2025-05-07T20:32:10.4024866Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.4025225Z     
2025-05-07T20:32:10.4025421Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.4025721Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.4026037Z         x = x_sign * x_clamp
2025-05-07T20:32:10.4026280Z         x0 = x[:, :D]
2025-05-07T20:32:10.4026508Z         x1 = x[:, D:]
2025-05-07T20:32:10.4026723Z     
2025-05-07T20:32:10.4026911Z         if contiguous:
2025-05-07T20:32:10.4027156Z             x0 = x0.contiguous()
2025-05-07T20:32:10.4027508Z             x1 = x1.contiguous()
2025-05-07T20:32:10.4027751Z     
2025-05-07T20:32:10.4027953Z         if scale_ub is not None:
2025-05-07T20:32:10.4028242Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.4028579Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.4028891Z             )
2025-05-07T20:32:10.4029092Z         else:
2025-05-07T20:32:10.4029303Z             scale_ub_tensor = None
2025-05-07T20:32:10.4029566Z     
2025-05-07T20:32:10.4029815Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.4030136Z             op = silu_mul_quant
2025-05-07T20:32:10.4030387Z             if compiled:
2025-05-07T20:32:10.4030642Z                 op = torch.compile(op)
2025-05-07T20:32:10.4030950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.4031225Z     
2025-05-07T20:32:10.4031435Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.4031615Z 
2025-05-07T20:32:10.4031751Z moe/activation_test.py:117: 
2025-05-07T20:32:10.4032072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.4032412Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.4032709Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.4033288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.4033856Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.4034532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.4035297Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.4035843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.4036542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.4037228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.4037781Z     kernel = self.compile(
2025-05-07T20:32:10.4038331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.4039053Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.4039458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.4039687Z 
2025-05-07T20:32:10.4039903Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebf62a10>
2025-05-07T20:32:10.4041014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.4042425Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebfe9240>}
2025-05-07T20:32:10.4043810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.4044864Z context = <triton._C.libtriton.ir.context object at 0x7faaeba4e470>
2025-05-07T20:32:10.4045159Z 
2025-05-07T20:32:10.4045333Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.4045874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.4046366Z                            module_map=module_map)
2025-05-07T20:32:10.4046743Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.4047103Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.4047375Z E       ^
2025-05-07T20:32:10.4047964Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.4048425Z 
2025-05-07T20:32:10.4048854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.4049387Z 
2025-05-07T20:32:10.7474870Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7475852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7476668Z     T=4096,
2025-05-07T20:32:10.7477051Z     D=5120,
2025-05-07T20:32:10.7477456Z     scale_ub=1200.0,
2025-05-07T20:32:10.7477910Z     contiguous=False,
2025-05-07T20:32:10.7478370Z     compiled=False,
2025-05-07T20:32:10.7478797Z )
2025-05-07T20:32:10.7479433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7480446Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:10.7481014Z 
2025-05-07T20:32:10.7481171Z     @given(
2025-05-07T20:32:10.7481651Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7482150Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7482502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7482850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7483185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7483480Z     )
2025-05-07T20:32:10.7483843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7484293Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7484762Z         self,
2025-05-07T20:32:10.7484968Z         T: int,
2025-05-07T20:32:10.7485166Z         D: int,
2025-05-07T20:32:10.7485395Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7485677Z         contiguous: bool,
2025-05-07T20:32:10.7485928Z         compiled: bool,
2025-05-07T20:32:10.7486158Z     ) -> None:
2025-05-07T20:32:10.7486383Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7486633Z     
2025-05-07T20:32:10.7486914Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7487268Z     
2025-05-07T20:32:10.7487471Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.7487853Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.7488177Z         x = x_sign * x_clamp
2025-05-07T20:32:10.7488423Z         x0 = x[:, :D]
2025-05-07T20:32:10.7488640Z         x1 = x[:, D:]
2025-05-07T20:32:10.7488854Z     
2025-05-07T20:32:10.7489048Z         if contiguous:
2025-05-07T20:32:10.7489286Z             x0 = x0.contiguous()
2025-05-07T20:32:10.7489556Z             x1 = x1.contiguous()
2025-05-07T20:32:10.7489801Z     
2025-05-07T20:32:10.7489997Z         if scale_ub is not None:
2025-05-07T20:32:10.7490275Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.7490619Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.7490939Z             )
2025-05-07T20:32:10.7491133Z         else:
2025-05-07T20:32:10.7491350Z             scale_ub_tensor = None
2025-05-07T20:32:10.7491617Z     
2025-05-07T20:32:10.7491855Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.7492176Z             op = silu_mul_quant
2025-05-07T20:32:10.7492435Z             if compiled:
2025-05-07T20:32:10.7492683Z                 op = torch.compile(op)
2025-05-07T20:32:10.7492994Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.7493271Z     
2025-05-07T20:32:10.7493465Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.7493639Z 
2025-05-07T20:32:10.7493745Z moe/activation_test.py:117: 
2025-05-07T20:32:10.7494050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.7494378Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.7494675Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.7495381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.7496230Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.7496782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.7497487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.7498249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.7498793Z     kernel = self.compile(
2025-05-07T20:32:10.7499347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.7500025Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.7500429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.7500655Z 
2025-05-07T20:32:10.7500871Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebac0850>
2025-05-07T20:32:10.7502016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.7503453Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebfeacb0>}
2025-05-07T20:32:10.7504827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.7505932Z context = <triton._C.libtriton.ir.context object at 0x7faaeb9af6f0>
2025-05-07T20:32:10.7506228Z 
2025-05-07T20:32:10.7506400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.7506935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.7507448Z                            module_map=module_map)
2025-05-07T20:32:10.7507825Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.7508183Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.7508494Z E       ^
2025-05-07T20:32:10.7508968Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.7509425Z 
2025-05-07T20:32:10.7509854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.7510375Z 
2025-05-07T20:32:10.7510484Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7510909Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7511318Z     T=4096,
2025-05-07T20:32:10.7511517Z     D=5120,
2025-05-07T20:32:10.7511712Z     scale_ub=1200.0,
2025-05-07T20:32:10.7511953Z     contiguous=False,
2025-05-07T20:32:10.7512212Z     compiled=True,
2025-05-07T20:32:10.7512445Z )
2025-05-07T20:32:10.7512776Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7513282Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:10.7513562Z 
2025-05-07T20:32:10.7513640Z     @given(
2025-05-07T20:32:10.7513876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7514197Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7514506Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7514848Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7515188Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7515479Z     )
2025-05-07T20:32:10.7515847Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7516296Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7516549Z         self,
2025-05-07T20:32:10.7516758Z         T: int,
2025-05-07T20:32:10.7516959Z         D: int,
2025-05-07T20:32:10.7517277Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7517563Z         contiguous: bool,
2025-05-07T20:32:10.7517809Z         compiled: bool,
2025-05-07T20:32:10.7528634Z     ) -> None:
2025-05-07T20:32:10.7528882Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7529139Z     
2025-05-07T20:32:10.7529418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7529756Z     
2025-05-07T20:32:10.7529957Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.7530260Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.7530574Z         x = x_sign * x_clamp
2025-05-07T20:32:10.7530824Z         x0 = x[:, :D]
2025-05-07T20:32:10.7531047Z         x1 = x[:, D:]
2025-05-07T20:32:10.7531254Z     
2025-05-07T20:32:10.7531450Z         if contiguous:
2025-05-07T20:32:10.7531717Z             x0 = x0.contiguous()
2025-05-07T20:32:10.7532013Z             x1 = x1.contiguous()
2025-05-07T20:32:10.7532254Z     
2025-05-07T20:32:10.7532461Z         if scale_ub is not None:
2025-05-07T20:32:10.7532746Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.7533085Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.7533405Z             )
2025-05-07T20:32:10.7533608Z         else:
2025-05-07T20:32:10.7533825Z             scale_ub_tensor = None
2025-05-07T20:32:10.7534089Z     
2025-05-07T20:32:10.7534334Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.7534654Z             op = silu_mul_quant
2025-05-07T20:32:10.7535000Z             if compiled:
2025-05-07T20:32:10.7535257Z                 op = torch.compile(op)
2025-05-07T20:32:10.7535557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.7535840Z     
2025-05-07T20:32:10.7536046Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.7536215Z 
2025-05-07T20:32:10.7536319Z moe/activation_test.py:117: 
2025-05-07T20:32:10.7536628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.7536974Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.7537268Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.7537841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.7538570Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.7539247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.7539941Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.7540493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.7541186Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.7541894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.7542453Z     kernel = self.compile(
2025-05-07T20:32:10.7543016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.7543697Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.7544110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.7544341Z 
2025-05-07T20:32:10.7544562Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeba726e0>
2025-05-07T20:32:10.7545672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.7547089Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebfeab90>}
2025-05-07T20:32:10.7548547Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.7549597Z context = <triton._C.libtriton.ir.context object at 0x7faaeb957eb0>
2025-05-07T20:32:10.7549900Z 
2025-05-07T20:32:10.7550072Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.7550608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.7551097Z                            module_map=module_map)
2025-05-07T20:32:10.7551468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.7551834Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.7552103Z E       ^
2025-05-07T20:32:10.7552576Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.7553042Z 
2025-05-07T20:32:10.7553470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.7553993Z 
2025-05-07T20:32:10.8823837Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.8824294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.8824722Z     T=2048,
2025-05-07T20:32:10.8824920Z     D=7168,
2025-05-07T20:32:10.8825124Z     scale_ub=1200.0,
2025-05-07T20:32:10.8825405Z     contiguous=False,
2025-05-07T20:32:10.8825633Z     compiled=False,
2025-05-07T20:32:10.8825966Z )
2025-05-07T20:32:10.8826298Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.8826811Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:10.8827103Z 
2025-05-07T20:32:10.8827184Z     @given(
2025-05-07T20:32:10.8827425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.8827741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.8828067Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.8828411Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.8828753Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.8829123Z     )
2025-05-07T20:32:10.8829487Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.8829942Z     def test_silu_mul_quant(
2025-05-07T20:32:10.8830186Z         self,
2025-05-07T20:32:10.8830381Z         T: int,
2025-05-07T20:32:10.8830583Z         D: int,
2025-05-07T20:32:10.8830804Z         scale_ub: Optional[float],
2025-05-07T20:32:10.8831082Z         contiguous: bool,
2025-05-07T20:32:10.8831325Z         compiled: bool,
2025-05-07T20:32:10.8831553Z     ) -> None:
2025-05-07T20:32:10.8831781Z         torch.manual_seed(2025)
2025-05-07T20:32:10.8832032Z     
2025-05-07T20:32:10.8832318Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.8832676Z     
2025-05-07T20:32:10.8832890Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.8833195Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.8833511Z         x = x_sign * x_clamp
2025-05-07T20:32:10.8833765Z         x0 = x[:, :D]
2025-05-07T20:32:10.8833988Z         x1 = x[:, D:]
2025-05-07T20:32:10.8834199Z     
2025-05-07T20:32:10.8834394Z         if contiguous:
2025-05-07T20:32:10.8834638Z             x0 = x0.contiguous()
2025-05-07T20:32:10.8834901Z             x1 = x1.contiguous()
2025-05-07T20:32:10.8835154Z     
2025-05-07T20:32:10.8835357Z         if scale_ub is not None:
2025-05-07T20:32:10.8835632Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.8835983Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.8836303Z             )
2025-05-07T20:32:10.8836499Z         else:
2025-05-07T20:32:10.8836729Z             scale_ub_tensor = None
2025-05-07T20:32:10.8836993Z     
2025-05-07T20:32:10.8837232Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.8837675Z             op = silu_mul_quant
2025-05-07T20:32:10.8837936Z             if compiled:
2025-05-07T20:32:10.8838189Z                 op = torch.compile(op)
2025-05-07T20:32:10.8838500Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.8838781Z     
2025-05-07T20:32:10.8838985Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.8839157Z 
2025-05-07T20:32:10.8839260Z moe/activation_test.py:117: 
2025-05-07T20:32:10.8839563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.8839901Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.8840188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.8840903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.8841614Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.8842177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.8842872Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.8843557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.8844109Z     kernel = self.compile(
2025-05-07T20:32:10.8844665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.8845344Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.8845822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.8846052Z 
2025-05-07T20:32:10.8846273Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb958a60>
2025-05-07T20:32:10.8847384Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.8848801Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb9c05e0>}
2025-05-07T20:32:10.8850227Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.8851288Z context = <triton._C.libtriton.ir.context object at 0x7faaeb9ed2f0>
2025-05-07T20:32:10.8851584Z 
2025-05-07T20:32:10.8851762Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.8852292Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.8852778Z                            module_map=module_map)
2025-05-07T20:32:10.8853159Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.8853517Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.8853790Z E       ^
2025-05-07T20:32:10.8854277Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.8854739Z 
2025-05-07T20:32:10.8855174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.8856098Z 
2025-05-07T20:32:10.8856215Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.8856659Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.8857077Z     T=1,
2025-05-07T20:32:10.8857268Z     D=7168,
2025-05-07T20:32:10.8857477Z     scale_ub=None,
2025-05-07T20:32:10.8857708Z     contiguous=True,
2025-05-07T20:32:10.8857936Z     compiled=False,
2025-05-07T20:32:10.8858223Z )
2025-05-07T20:32:10.8858554Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.8859201Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:10.8859474Z 
2025-05-07T20:32:10.8859554Z     @given(
2025-05-07T20:32:10.8859801Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.8860126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.8860441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.8860786Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.8861128Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.8861421Z     )
2025-05-07T20:32:10.8861789Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.8862246Z     def test_silu_mul_quant(
2025-05-07T20:32:10.8862493Z         self,
2025-05-07T20:32:10.8862703Z         T: int,
2025-05-07T20:32:10.8862913Z         D: int,
2025-05-07T20:32:10.8863142Z         scale_ub: Optional[float],
2025-05-07T20:32:10.8863421Z         contiguous: bool,
2025-05-07T20:32:10.8863676Z         compiled: bool,
2025-05-07T20:32:10.8863914Z     ) -> None:
2025-05-07T20:32:10.8864136Z         torch.manual_seed(2025)
2025-05-07T20:32:10.8864389Z     
2025-05-07T20:32:10.8864678Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.8865029Z     
2025-05-07T20:32:10.8865236Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.8865536Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.8865849Z         x = x_sign * x_clamp
2025-05-07T20:32:10.8866164Z         x0 = x[:, :D]
2025-05-07T20:32:10.8866391Z         x1 = x[:, D:]
2025-05-07T20:32:10.8866603Z     
2025-05-07T20:32:10.8866799Z         if contiguous:
2025-05-07T20:32:10.8867043Z             x0 = x0.contiguous()
2025-05-07T20:32:10.8867309Z             x1 = x1.contiguous()
2025-05-07T20:32:10.8867561Z     
2025-05-07T20:32:10.8867764Z         if scale_ub is not None:
2025-05-07T20:32:10.8868053Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.8868407Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.8868726Z             )
2025-05-07T20:32:10.8868930Z         else:
2025-05-07T20:32:10.8869227Z             scale_ub_tensor = None
2025-05-07T20:32:10.8869492Z     
2025-05-07T20:32:10.8869740Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.8870065Z             op = silu_mul_quant
2025-05-07T20:32:10.8870328Z             if compiled:
2025-05-07T20:32:10.8870585Z                 op = torch.compile(op)
2025-05-07T20:32:10.8870899Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.8871188Z     
2025-05-07T20:32:10.8871388Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.8871569Z 
2025-05-07T20:32:10.8871687Z moe/activation_test.py:117: 
2025-05-07T20:32:10.8872031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.8872366Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.8872662Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.8873380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.8874093Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.8874645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.8875348Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.8876036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.8876583Z     kernel = self.compile(
2025-05-07T20:32:10.8877147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.8877825Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.8878234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.8878617Z 
2025-05-07T20:32:10.8878838Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb903c40>
2025-05-07T20:32:10.8879956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.8881371Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb9c0d30>}
2025-05-07T20:32:10.8882759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.8883820Z context = <triton._C.libtriton.ir.context object at 0x7faaeb738e30>
2025-05-07T20:32:10.8884119Z 
2025-05-07T20:32:10.8884296Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.8884842Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.8885331Z                            module_map=module_map)
2025-05-07T20:32:10.8885703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.8886074Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.8886343Z E       ^
2025-05-07T20:32:10.8886819Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.8887333Z 
2025-05-07T20:32:10.8887764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.8888296Z 
2025-05-07T20:32:10.8888405Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.8888835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.8889245Z     T=16384,
2025-05-07T20:32:10.8889455Z     D=7168,
2025-05-07T20:32:10.8889659Z     scale_ub=1200.0,
2025-05-07T20:32:10.8889890Z     contiguous=False,
2025-05-07T20:32:10.8890176Z     compiled=True,
2025-05-07T20:32:11.1538371Z )
2025-05-07T20:32:11.1539477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.1540925Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.1541526Z 
2025-05-07T20:32:11.1541610Z     @given(
2025-05-07T20:32:11.1541866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.1542193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.1542508Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.1542854Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.1543197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.1543486Z     )
2025-05-07T20:32:11.1543856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.1544312Z     def test_silu_mul_quant(
2025-05-07T20:32:11.1544564Z         self,
2025-05-07T20:32:11.1544761Z         T: int,
2025-05-07T20:32:11.1544970Z         D: int,
2025-05-07T20:32:11.1545201Z         scale_ub: Optional[float],
2025-05-07T20:32:11.1545474Z         contiguous: bool,
2025-05-07T20:32:11.1545720Z         compiled: bool,
2025-05-07T20:32:11.1545953Z     ) -> None:
2025-05-07T20:32:11.1546176Z         torch.manual_seed(2025)
2025-05-07T20:32:11.1546427Z     
2025-05-07T20:32:11.1546716Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.1547063Z     
2025-05-07T20:32:11.1547272Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.1547573Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.1547884Z         x = x_sign * x_clamp
2025-05-07T20:32:11.1548132Z         x0 = x[:, :D]
2025-05-07T20:32:11.1548354Z         x1 = x[:, D:]
2025-05-07T20:32:11.1548563Z     
2025-05-07T20:32:11.1548957Z         if contiguous:
2025-05-07T20:32:11.1549204Z             x0 = x0.contiguous()
2025-05-07T20:32:11.1549466Z             x1 = x1.contiguous()
2025-05-07T20:32:11.1549718Z     
2025-05-07T20:32:11.1549918Z         if scale_ub is not None:
2025-05-07T20:32:11.1550203Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.1550543Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.1550856Z             )
2025-05-07T20:32:11.1551055Z         else:
2025-05-07T20:32:11.1551268Z             scale_ub_tensor = None
2025-05-07T20:32:11.1551526Z     
2025-05-07T20:32:11.1551774Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.1552090Z             op = silu_mul_quant
2025-05-07T20:32:11.1552347Z             if compiled:
2025-05-07T20:32:11.1552605Z                 op = torch.compile(op)
2025-05-07T20:32:11.1552911Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1553196Z     
2025-05-07T20:32:11.1553407Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.1553579Z 
2025-05-07T20:32:11.1553687Z moe/activation_test.py:117: 
2025-05-07T20:32:11.1553992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1554330Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.1554623Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1555200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.1555970Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.1556722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.1557423Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.1557975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.1558673Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.1559358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.1559965Z     kernel = self.compile(
2025-05-07T20:32:11.1560523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.1561199Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.1561599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1561838Z 
2025-05-07T20:32:11.1562053Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb744670>
2025-05-07T20:32:11.1563163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.1564585Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb9c1bd0>}
2025-05-07T20:32:11.1565962Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.1567012Z context = <triton._C.libtriton.ir.context object at 0x7faaeb755730>
2025-05-07T20:32:11.1567311Z 
2025-05-07T20:32:11.1567487Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.1568023Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.1568507Z                            module_map=module_map)
2025-05-07T20:32:11.1568878Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.1569241Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.1569507Z E       ^
2025-05-07T20:32:11.1570094Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.1570562Z 
2025-05-07T20:32:11.1570989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.1571518Z 
2025-05-07T20:32:11.1571626Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1572051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1572460Z     T=1,
2025-05-07T20:32:11.1572654Z     D=7168,
2025-05-07T20:32:11.1572855Z     scale_ub=None,
2025-05-07T20:32:11.1573075Z     contiguous=False,
2025-05-07T20:32:11.1573308Z     compiled=False,
2025-05-07T20:32:11.1573522Z )
2025-05-07T20:32:11.1573846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.1574346Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.1574618Z 
2025-05-07T20:32:11.1574706Z     @given(
2025-05-07T20:32:11.1574947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.1575263Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.1575585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.1575925Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.1576257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.1576550Z     )
2025-05-07T20:32:11.1576910Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.1577403Z     def test_silu_mul_quant(
2025-05-07T20:32:11.1577651Z         self,
2025-05-07T20:32:11.1577856Z         T: int,
2025-05-07T20:32:11.1578120Z         D: int,
2025-05-07T20:32:11.1578353Z         scale_ub: Optional[float],
2025-05-07T20:32:11.1578633Z         contiguous: bool,
2025-05-07T20:32:11.1578886Z         compiled: bool,
2025-05-07T20:32:11.1579131Z     ) -> None:
2025-05-07T20:32:11.1579366Z         torch.manual_seed(2025)
2025-05-07T20:32:11.1579615Z     
2025-05-07T20:32:11.1579900Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.1580297Z     
2025-05-07T20:32:11.1580499Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.1580801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.1581115Z         x = x_sign * x_clamp
2025-05-07T20:32:11.1581362Z         x0 = x[:, :D]
2025-05-07T20:32:11.1581586Z         x1 = x[:, D:]
2025-05-07T20:32:11.1581793Z     
2025-05-07T20:32:11.1581989Z         if contiguous:
2025-05-07T20:32:11.1582230Z             x0 = x0.contiguous()
2025-05-07T20:32:11.1582488Z             x1 = x1.contiguous()
2025-05-07T20:32:11.1582740Z     
2025-05-07T20:32:11.1582940Z         if scale_ub is not None:
2025-05-07T20:32:11.1583214Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.1583559Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.1583874Z             )
2025-05-07T20:32:11.1584071Z         else:
2025-05-07T20:32:11.1584290Z             scale_ub_tensor = None
2025-05-07T20:32:11.1584548Z     
2025-05-07T20:32:11.1584789Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.1585108Z             op = silu_mul_quant
2025-05-07T20:32:11.1585363Z             if compiled:
2025-05-07T20:32:11.1585620Z                 op = torch.compile(op)
2025-05-07T20:32:11.1585919Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1586199Z     
2025-05-07T20:32:11.1586400Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.1586571Z 
2025-05-07T20:32:11.1586673Z moe/activation_test.py:117: 
2025-05-07T20:32:11.1586975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1587308Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.1587594Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1588383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.1589088Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.1589634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.1590326Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.1591003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.1591546Z     kernel = self.compile(
2025-05-07T20:32:11.1592103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.1592766Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.1600632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1600902Z 
2025-05-07T20:32:11.1601133Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb7f09a0>
2025-05-07T20:32:11.1602297Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.1603701Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb9c2050>}
2025-05-07T20:32:11.1605071Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.1606189Z context = <triton._C.libtriton.ir.context object at 0x7fac9c08aff0>
2025-05-07T20:32:11.1606482Z 
2025-05-07T20:32:11.1606656Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.1607187Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.1607662Z                            module_map=module_map)
2025-05-07T20:32:11.1608033Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.1608436Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.1608704Z E       ^
2025-05-07T20:32:11.1609178Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.1609634Z 
2025-05-07T20:32:11.1610065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.1610586Z 
2025-05-07T20:32:11.1610692Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1611114Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1611515Z     T=2048,
2025-05-07T20:32:11.1611703Z     D=7168,
2025-05-07T20:32:11.1611899Z     scale_ub=None,
2025-05-07T20:32:11.1612129Z     contiguous=False,
2025-05-07T20:32:11.1612397Z     compiled=True,
2025-05-07T20:32:11.1612610Z )
2025-05-07T20:32:11.2607085Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.2608613Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.2609390Z 
2025-05-07T20:32:11.2609608Z     @given(
2025-05-07T20:32:11.2610266Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.2611114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.2611712Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.2612052Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.2612384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.2612676Z     )
2025-05-07T20:32:11.2613034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.2613484Z     def test_silu_mul_quant(
2025-05-07T20:32:11.2613729Z         self,
2025-05-07T20:32:11.2614098Z         T: int,
2025-05-07T20:32:11.2614306Z         D: int,
2025-05-07T20:32:11.2614524Z         scale_ub: Optional[float],
2025-05-07T20:32:11.2614808Z         contiguous: bool,
2025-05-07T20:32:11.2615054Z         compiled: bool,
2025-05-07T20:32:11.2615280Z     ) -> None:
2025-05-07T20:32:11.2615507Z         torch.manual_seed(2025)
2025-05-07T20:32:11.2615754Z     
2025-05-07T20:32:11.2616035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.2616386Z     
2025-05-07T20:32:11.2616589Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.2616884Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.2617201Z         x = x_sign * x_clamp
2025-05-07T20:32:11.2617448Z         x0 = x[:, :D]
2025-05-07T20:32:11.2617666Z         x1 = x[:, D:]
2025-05-07T20:32:11.2617876Z     
2025-05-07T20:32:11.2618142Z         if contiguous:
2025-05-07T20:32:11.2618377Z             x0 = x0.contiguous()
2025-05-07T20:32:11.2618646Z             x1 = x1.contiguous()
2025-05-07T20:32:11.2618887Z     
2025-05-07T20:32:11.2619083Z         if scale_ub is not None:
2025-05-07T20:32:11.2619360Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.2619698Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.2620009Z             )
2025-05-07T20:32:11.2620204Z         else:
2025-05-07T20:32:11.2620412Z             scale_ub_tensor = None
2025-05-07T20:32:11.2620669Z     
2025-05-07T20:32:11.2620911Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.2621293Z             op = silu_mul_quant
2025-05-07T20:32:11.2621549Z             if compiled:
2025-05-07T20:32:11.2621804Z                 op = torch.compile(op)
2025-05-07T20:32:11.2622101Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.2622382Z     
2025-05-07T20:32:11.2622582Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.2622746Z 
2025-05-07T20:32:11.2622846Z moe/activation_test.py:117: 
2025-05-07T20:32:11.2623144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.2623477Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.2623771Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.2624409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.2624977Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.2625642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.2626338Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.2626880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.2627566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.2628239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.2628775Z     kernel = self.compile(
2025-05-07T20:32:11.2629333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.2630002Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.2630392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.2630622Z 
2025-05-07T20:32:11.2630835Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb7cc790>
2025-05-07T20:32:11.2631938Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.2633334Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb9c31c0>}
2025-05-07T20:32:11.2634777Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.2635818Z context = <triton._C.libtriton.ir.context object at 0x7fac9c040130>
2025-05-07T20:32:11.2636113Z 
2025-05-07T20:32:11.2636283Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.2636811Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.2637287Z                            module_map=module_map)
2025-05-07T20:32:11.2637651Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.2638013Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.2638273Z E       ^
2025-05-07T20:32:11.2638739Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.2639204Z 
2025-05-07T20:32:11.2639625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.2640150Z 
2025-05-07T20:32:11.2640255Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.2640675Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.2641074Z     T=4096,
2025-05-07T20:32:11.2641268Z     D=7168,
2025-05-07T20:32:11.2641464Z     scale_ub=None,
2025-05-07T20:32:11.2641725Z     contiguous=False,
2025-05-07T20:32:11.2641953Z     compiled=True,
2025-05-07T20:32:11.2642158Z )
2025-05-07T20:32:11.2642480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.2642978Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.2643256Z 
2025-05-07T20:32:11.2643332Z     @given(
2025-05-07T20:32:11.2643562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.2643882Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.2644189Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.2644523Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.2644900Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.2645188Z     )
2025-05-07T20:32:11.2645543Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.2645983Z     def test_silu_mul_quant(
2025-05-07T20:32:11.2646229Z         self,
2025-05-07T20:32:11.2646428Z         T: int,
2025-05-07T20:32:11.2646624Z         D: int,
2025-05-07T20:32:11.2646848Z         scale_ub: Optional[float],
2025-05-07T20:32:11.2647122Z         contiguous: bool,
2025-05-07T20:32:11.2647368Z         compiled: bool,
2025-05-07T20:32:11.2647588Z     ) -> None:
2025-05-07T20:32:11.2647808Z         torch.manual_seed(2025)
2025-05-07T20:32:11.2648052Z     
2025-05-07T20:32:11.2648329Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.2648674Z     
2025-05-07T20:32:11.2648870Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.2649160Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.2649477Z         x = x_sign * x_clamp
2025-05-07T20:32:11.2649720Z         x0 = x[:, :D]
2025-05-07T20:32:11.2649934Z         x1 = x[:, D:]
2025-05-07T20:32:11.2650145Z     
2025-05-07T20:32:11.2650334Z         if contiguous:
2025-05-07T20:32:11.2650564Z             x0 = x0.contiguous()
2025-05-07T20:32:11.2650825Z             x1 = x1.contiguous()
2025-05-07T20:32:11.2651068Z     
2025-05-07T20:32:11.2651256Z         if scale_ub is not None:
2025-05-07T20:32:11.2651536Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.2651874Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.2652181Z             )
2025-05-07T20:32:11.2652371Z         else:
2025-05-07T20:32:11.2652585Z             scale_ub_tensor = None
2025-05-07T20:32:11.2652840Z     
2025-05-07T20:32:11.2653153Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.2653470Z             op = silu_mul_quant
2025-05-07T20:32:11.2653723Z             if compiled:
2025-05-07T20:32:11.2653972Z                 op = torch.compile(op)
2025-05-07T20:32:11.2654271Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.2654547Z     
2025-05-07T20:32:11.2654739Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.2654911Z 
2025-05-07T20:32:11.2655011Z moe/activation_test.py:117: 
2025-05-07T20:32:11.2655310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.2655814Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.2656099Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.2656666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.2657230Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.2657895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.2658649Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.2659200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.2659887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.2660559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.2661168Z     kernel = self.compile(
2025-05-07T20:32:11.2661720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.2662382Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.2662781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.2663006Z 
2025-05-07T20:32:11.2663232Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9c0dd960>
2025-05-07T20:32:11.2664328Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.2665788Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb6bc1f0>}
2025-05-07T20:32:11.2667159Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.2668201Z context = <triton._C.libtriton.ir.context object at 0x7faaeb6a6cf0>
2025-05-07T20:32:11.2668493Z 
2025-05-07T20:32:11.2668670Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.2669202Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.2669683Z                            module_map=module_map)
2025-05-07T20:32:11.2670052Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.2670414Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.2670669Z E       ^
2025-05-07T20:32:11.2671138Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.2671597Z 
2025-05-07T20:32:11.2672023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.2672539Z 
2025-05-07T20:32:11.6236075Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6236757Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6237324Z     T=16384,
2025-05-07T20:32:11.6237647Z     D=5120,
2025-05-07T20:32:11.6238094Z     scale_ub=1200.0,
2025-05-07T20:32:11.6238332Z     contiguous=False,
2025-05-07T20:32:11.6238564Z     compiled=False,
2025-05-07T20:32:11.6238767Z )
2025-05-07T20:32:11.6239094Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6239604Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.6239889Z 
2025-05-07T20:32:11.6239965Z     @given(
2025-05-07T20:32:11.6240197Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6240519Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6240829Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6241158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6241490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6241776Z     )
2025-05-07T20:32:11.6242126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6242604Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6242873Z         self,
2025-05-07T20:32:11.6243062Z         T: int,
2025-05-07T20:32:11.6243261Z         D: int,
2025-05-07T20:32:11.6243485Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6243758Z         contiguous: bool,
2025-05-07T20:32:11.6243999Z         compiled: bool,
2025-05-07T20:32:11.6244223Z     ) -> None:
2025-05-07T20:32:11.6244434Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6244681Z     
2025-05-07T20:32:11.6244959Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6245370Z     
2025-05-07T20:32:11.6245563Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6245857Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6246165Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6246402Z         x0 = x[:, :D]
2025-05-07T20:32:11.6246617Z         x1 = x[:, D:]
2025-05-07T20:32:11.6246826Z     
2025-05-07T20:32:11.6247007Z         if contiguous:
2025-05-07T20:32:11.6247249Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6247509Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6247741Z     
2025-05-07T20:32:11.6247933Z         if scale_ub is not None:
2025-05-07T20:32:11.6248279Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6248610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6248920Z             )
2025-05-07T20:32:11.6249113Z         else:
2025-05-07T20:32:11.6249319Z             scale_ub_tensor = None
2025-05-07T20:32:11.6249570Z     
2025-05-07T20:32:11.6249807Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6250117Z             op = silu_mul_quant
2025-05-07T20:32:11.6250368Z             if compiled:
2025-05-07T20:32:11.6250615Z                 op = torch.compile(op)
2025-05-07T20:32:11.6250915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6251181Z     
2025-05-07T20:32:11.6251376Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6251538Z 
2025-05-07T20:32:11.6251648Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6251973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6252320Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6252603Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6253293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6253993Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6254534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6255223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6256079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6256619Z     kernel = self.compile(
2025-05-07T20:32:11.6257292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6257957Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6258410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6258640Z 
2025-05-07T20:32:11.6258852Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9c0dc550>
2025-05-07T20:32:11.6259947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6261348Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb6bc700>}
2025-05-07T20:32:11.6262769Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6263811Z context = <triton._C.libtriton.ir.context object at 0x7faaeb6d40f0>
2025-05-07T20:32:11.6264110Z 
2025-05-07T20:32:11.6264277Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6264804Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6265274Z                            module_map=module_map)
2025-05-07T20:32:11.6265767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6266122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6266375Z E       ^
2025-05-07T20:32:11.6266843Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6267302Z 
2025-05-07T20:32:11.6267728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6268243Z 
2025-05-07T20:32:11.6268357Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6268834Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6269241Z     T=16384,
2025-05-07T20:32:11.6269443Z     D=5120,
2025-05-07T20:32:11.6269636Z     scale_ub=1200.0,
2025-05-07T20:32:11.6269869Z     contiguous=True,
2025-05-07T20:32:11.6270091Z     compiled=True,
2025-05-07T20:32:11.6270291Z )
2025-05-07T20:32:11.6270616Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6271116Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.6271393Z 
2025-05-07T20:32:11.6271475Z     @given(
2025-05-07T20:32:11.6271703Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6272018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6272350Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6272709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6273042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6273332Z     )
2025-05-07T20:32:11.6273681Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6274127Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6274369Z         self,
2025-05-07T20:32:11.6274565Z         T: int,
2025-05-07T20:32:11.6274756Z         D: int,
2025-05-07T20:32:11.6274977Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6275252Z         contiguous: bool,
2025-05-07T20:32:11.6275489Z         compiled: bool,
2025-05-07T20:32:11.6275721Z     ) -> None:
2025-05-07T20:32:11.6275940Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6276179Z     
2025-05-07T20:32:11.6276456Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6276802Z     
2025-05-07T20:32:11.6276997Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6277376Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6277690Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6277929Z         x0 = x[:, :D]
2025-05-07T20:32:11.6278149Z         x1 = x[:, D:]
2025-05-07T20:32:11.6278361Z     
2025-05-07T20:32:11.6278544Z         if contiguous:
2025-05-07T20:32:11.6278779Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6279037Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6279275Z     
2025-05-07T20:32:11.6279470Z         if scale_ub is not None:
2025-05-07T20:32:11.6279750Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6280092Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6280398Z             )
2025-05-07T20:32:11.6280592Z         else:
2025-05-07T20:32:11.6280807Z             scale_ub_tensor = None
2025-05-07T20:32:11.6281056Z     
2025-05-07T20:32:11.6281288Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6281607Z             op = silu_mul_quant
2025-05-07T20:32:11.6281882Z             if compiled:
2025-05-07T20:32:11.6282161Z                 op = torch.compile(op)
2025-05-07T20:32:11.6282464Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6282737Z     
2025-05-07T20:32:11.6282934Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6283098Z 
2025-05-07T20:32:11.6283203Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6283497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6283821Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6284155Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6284721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6285283Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6285950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6286653Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6287191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6287924Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6288601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6289137Z     kernel = self.compile(
2025-05-07T20:32:11.6289684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6290351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6290749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6290976Z 
2025-05-07T20:32:11.6291193Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb689ab0>
2025-05-07T20:32:11.6292287Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6293687Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb6bd7e0>}
2025-05-07T20:32:11.6295054Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6296100Z context = <triton._C.libtriton.ir.context object at 0x7faaeb6acd70>
2025-05-07T20:32:11.6296392Z 
2025-05-07T20:32:11.6296561Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6297084Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6297639Z                            module_map=module_map)
2025-05-07T20:32:11.6298083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6298443Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6298707Z E       ^
2025-05-07T20:32:11.6299182Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6299633Z 
2025-05-07T20:32:11.6300060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6300580Z 
2025-05-07T20:32:11.8197526Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.8198134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.8198758Z     T=16384,
2025-05-07T20:32:11.8199029Z     D=5120,
2025-05-07T20:32:11.8199303Z     scale_ub=None,
2025-05-07T20:32:11.8199618Z     contiguous=False,
2025-05-07T20:32:11.8199851Z     compiled=True,
2025-05-07T20:32:11.8200067Z )
2025-05-07T20:32:11.8200395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.8200901Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.8201189Z 
2025-05-07T20:32:11.8201269Z     @given(
2025-05-07T20:32:11.8201506Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.8201819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.8202132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.8202594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.8202937Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.8203233Z     )
2025-05-07T20:32:11.8203591Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.8210370Z     def test_silu_mul_quant(
2025-05-07T20:32:11.8210638Z         self,
2025-05-07T20:32:11.8210843Z         T: int,
2025-05-07T20:32:11.8211052Z         D: int,
2025-05-07T20:32:11.8211270Z         scale_ub: Optional[float],
2025-05-07T20:32:11.8211549Z         contiguous: bool,
2025-05-07T20:32:11.8211896Z         compiled: bool,
2025-05-07T20:32:11.8212123Z     ) -> None:
2025-05-07T20:32:11.8212345Z         torch.manual_seed(2025)
2025-05-07T20:32:11.8212584Z     
2025-05-07T20:32:11.8212862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.8213213Z     
2025-05-07T20:32:11.8213413Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.8213715Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.8214031Z         x = x_sign * x_clamp
2025-05-07T20:32:11.8214271Z         x0 = x[:, :D]
2025-05-07T20:32:11.8214491Z         x1 = x[:, D:]
2025-05-07T20:32:11.8214701Z     
2025-05-07T20:32:11.8214883Z         if contiguous:
2025-05-07T20:32:11.8215121Z             x0 = x0.contiguous()
2025-05-07T20:32:11.8215385Z             x1 = x1.contiguous()
2025-05-07T20:32:11.8215631Z     
2025-05-07T20:32:11.8215827Z         if scale_ub is not None:
2025-05-07T20:32:11.8216106Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.8216455Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.8216767Z             )
2025-05-07T20:32:11.8216967Z         else:
2025-05-07T20:32:11.8217184Z             scale_ub_tensor = None
2025-05-07T20:32:11.8217435Z     
2025-05-07T20:32:11.8217671Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.8218072Z             op = silu_mul_quant
2025-05-07T20:32:11.8218326Z             if compiled:
2025-05-07T20:32:11.8218582Z                 op = torch.compile(op)
2025-05-07T20:32:11.8218886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.8219157Z     
2025-05-07T20:32:11.8219357Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.8219524Z 
2025-05-07T20:32:11.8219629Z moe/activation_test.py:117: 
2025-05-07T20:32:11.8220054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.8220381Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.8220668Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.8221246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.8221811Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.8222482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.8223182Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.8223721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.8224403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.8225077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.8225616Z     kernel = self.compile(
2025-05-07T20:32:11.8226164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.8226829Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.8227228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.8227454Z 
2025-05-07T20:32:11.8227672Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb622140>
2025-05-07T20:32:11.8228763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.8230209Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb6be680>}
2025-05-07T20:32:11.8231581Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.8232660Z context = <triton._C.libtriton.ir.context object at 0x7faaeb5af370>
2025-05-07T20:32:11.8232951Z 
2025-05-07T20:32:11.8233122Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.8233644Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.8234122Z                            module_map=module_map)
2025-05-07T20:32:11.8234491Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.8234848Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.8235103Z E       ^
2025-05-07T20:32:11.8235576Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.8236030Z 
2025-05-07T20:32:11.8236462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.8236982Z 
2025-05-07T20:32:11.8237090Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.8237506Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.8237909Z     T=2048,
2025-05-07T20:32:11.8238101Z     D=5120,
2025-05-07T20:32:11.8238290Z     scale_ub=None,
2025-05-07T20:32:11.8238508Z     contiguous=False,
2025-05-07T20:32:11.8238740Z     compiled=True,
2025-05-07T20:32:11.8238940Z )
2025-05-07T20:32:11.9275432Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.9276159Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.9276556Z 
2025-05-07T20:32:11.9276672Z     @given(
2025-05-07T20:32:11.9276993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.9277440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.9278059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.9278491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.9278835Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.9279127Z     )
2025-05-07T20:32:11.9279481Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.9279932Z     def test_silu_mul_quant(
2025-05-07T20:32:11.9280212Z         self,
2025-05-07T20:32:11.9280418Z         T: int,
2025-05-07T20:32:11.9280624Z         D: int,
2025-05-07T20:32:11.9280846Z         scale_ub: Optional[float],
2025-05-07T20:32:11.9281125Z         contiguous: bool,
2025-05-07T20:32:11.9281372Z         compiled: bool,
2025-05-07T20:32:11.9281603Z     ) -> None:
2025-05-07T20:32:11.9281828Z         torch.manual_seed(2025)
2025-05-07T20:32:11.9282075Z     
2025-05-07T20:32:11.9282351Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.9282697Z     
2025-05-07T20:32:11.9282901Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.9283194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.9283511Z         x = x_sign * x_clamp
2025-05-07T20:32:11.9283758Z         x0 = x[:, :D]
2025-05-07T20:32:11.9283976Z         x1 = x[:, D:]
2025-05-07T20:32:11.9284192Z     
2025-05-07T20:32:11.9284387Z         if contiguous:
2025-05-07T20:32:11.9284622Z             x0 = x0.contiguous()
2025-05-07T20:32:11.9284888Z             x1 = x1.contiguous()
2025-05-07T20:32:11.9285203Z     
2025-05-07T20:32:11.9285405Z         if scale_ub is not None:
2025-05-07T20:32:11.9285681Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.9286026Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.9286341Z             )
2025-05-07T20:32:11.9286534Z         else:
2025-05-07T20:32:11.9286749Z             scale_ub_tensor = None
2025-05-07T20:32:11.9287005Z     
2025-05-07T20:32:11.9287244Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.9287566Z             op = silu_mul_quant
2025-05-07T20:32:11.9287822Z             if compiled:
2025-05-07T20:32:11.9288069Z                 op = torch.compile(op)
2025-05-07T20:32:11.9288443Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9288723Z     
2025-05-07T20:32:11.9288916Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.9289089Z 
2025-05-07T20:32:11.9289192Z moe/activation_test.py:117: 
2025-05-07T20:32:11.9289499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9289840Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.9290126Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9290694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.9291263Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.9291934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.9292694Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.9293237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.9293929Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.9294598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.9295143Z     kernel = self.compile(
2025-05-07T20:32:11.9295700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.9296367Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.9296771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9297004Z 
2025-05-07T20:32:11.9297216Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb570640>
2025-05-07T20:32:11.9298501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.9299917Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb6be560>}
2025-05-07T20:32:11.9301279Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.9302329Z context = <triton._C.libtriton.ir.context object at 0x7faaeb5cde30>
2025-05-07T20:32:11.9302629Z 
2025-05-07T20:32:11.9302797Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.9303332Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.9303804Z                            module_map=module_map)
2025-05-07T20:32:11.9304181Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.9304543Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.9304805Z E       ^
2025-05-07T20:32:11.9305279Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.9305739Z 
2025-05-07T20:32:11.9306208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.9306727Z 
2025-05-07T20:32:11.9306840Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.9307258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.9307666Z     T=2048,
2025-05-07T20:32:11.9307860Z     D=5120,
2025-05-07T20:32:11.9308052Z     scale_ub=1200.0,
2025-05-07T20:32:11.9308289Z     contiguous=False,
2025-05-07T20:32:11.9308522Z     compiled=True,
2025-05-07T20:32:11.9308723Z )
2025-05-07T20:32:11.9309047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.9309596Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.9309873Z 
2025-05-07T20:32:11.9309956Z     @given(
2025-05-07T20:32:11.9310185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.9310501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.9310821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.9311153Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.9311491Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.9311793Z     )
2025-05-07T20:32:11.9312196Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.9312648Z     def test_silu_mul_quant(
2025-05-07T20:32:11.9312898Z         self,
2025-05-07T20:32:11.9313104Z         T: int,
2025-05-07T20:32:11.9313301Z         D: int,
2025-05-07T20:32:11.9313523Z         scale_ub: Optional[float],
2025-05-07T20:32:11.9313806Z         contiguous: bool,
2025-05-07T20:32:11.9314046Z         compiled: bool,
2025-05-07T20:32:11.9314276Z     ) -> None:
2025-05-07T20:32:11.9314498Z         torch.manual_seed(2025)
2025-05-07T20:32:11.9314740Z     
2025-05-07T20:32:11.9315022Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.9315372Z     
2025-05-07T20:32:11.9315565Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.9315865Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.9316184Z         x = x_sign * x_clamp
2025-05-07T20:32:11.9316421Z         x0 = x[:, :D]
2025-05-07T20:32:11.9316640Z         x1 = x[:, D:]
2025-05-07T20:32:11.9316853Z     
2025-05-07T20:32:11.9317041Z         if contiguous:
2025-05-07T20:32:11.9317279Z             x0 = x0.contiguous()
2025-05-07T20:32:11.9317679Z             x1 = x1.contiguous()
2025-05-07T20:32:11.9317924Z     
2025-05-07T20:32:11.9318118Z         if scale_ub is not None:
2025-05-07T20:32:11.9318398Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.9318749Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.9319055Z             )
2025-05-07T20:32:11.9319253Z         else:
2025-05-07T20:32:11.9319466Z             scale_ub_tensor = None
2025-05-07T20:32:11.9319715Z     
2025-05-07T20:32:11.9319958Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.9320278Z             op = silu_mul_quant
2025-05-07T20:32:11.9320533Z             if compiled:
2025-05-07T20:32:11.9320782Z                 op = torch.compile(op)
2025-05-07T20:32:11.9321083Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9321364Z     
2025-05-07T20:32:11.9321559Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.9321729Z 
2025-05-07T20:32:11.9321833Z moe/activation_test.py:117: 
2025-05-07T20:32:11.9322139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9322465Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.9322754Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.9323322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.9323888Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.9324551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.9325316Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.9325864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.9326549Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.9327226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.9327768Z     kernel = self.compile(
2025-05-07T20:32:11.9328318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.9329028Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.9329433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.9329660Z 
2025-05-07T20:32:11.9329880Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb5f0910>
2025-05-07T20:32:11.9330981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.9332371Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb6bf370>}
2025-05-07T20:32:11.9333735Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.9334783Z context = <triton._C.libtriton.ir.context object at 0x7faaeb288730>
2025-05-07T20:32:11.9335076Z 
2025-05-07T20:32:11.9335250Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.9335776Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.9336258Z                            module_map=module_map)
2025-05-07T20:32:11.9336629Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.9336986Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.9337249Z E       ^
2025-05-07T20:32:11.9337719Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.9338320Z 
2025-05-07T20:32:11.9338748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.9339271Z 
2025-05-07T20:32:12.1244444Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1244963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1245594Z     T=4096,
2025-05-07T20:32:12.1245862Z     D=5120,
2025-05-07T20:32:12.1246135Z     scale_ub=1200.0,
2025-05-07T20:32:12.1246458Z     contiguous=True,
2025-05-07T20:32:12.1246719Z     compiled=True,
2025-05-07T20:32:12.1246930Z )
2025-05-07T20:32:12.1247259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.1247766Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.1248040Z 
2025-05-07T20:32:12.1248125Z     @given(
2025-05-07T20:32:12.1248355Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.1248678Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.1248991Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.1249332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.1249661Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.1249950Z     )
2025-05-07T20:32:12.1250305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.1250752Z     def test_silu_mul_quant(
2025-05-07T20:32:12.1251109Z         self,
2025-05-07T20:32:12.1251310Z         T: int,
2025-05-07T20:32:12.1251507Z         D: int,
2025-05-07T20:32:12.1251729Z         scale_ub: Optional[float],
2025-05-07T20:32:12.1252010Z         contiguous: bool,
2025-05-07T20:32:12.1252250Z         compiled: bool,
2025-05-07T20:32:12.1252477Z     ) -> None:
2025-05-07T20:32:12.1252730Z         torch.manual_seed(2025)
2025-05-07T20:32:12.1252995Z     
2025-05-07T20:32:12.1253277Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.1253622Z     
2025-05-07T20:32:12.1253821Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.1254115Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.1254499Z         x = x_sign * x_clamp
2025-05-07T20:32:12.1254743Z         x0 = x[:, :D]
2025-05-07T20:32:12.1254958Z         x1 = x[:, D:]
2025-05-07T20:32:12.1255169Z     
2025-05-07T20:32:12.1255362Z         if contiguous:
2025-05-07T20:32:12.1255778Z             x0 = x0.contiguous()
2025-05-07T20:32:12.1256049Z             x1 = x1.contiguous()
2025-05-07T20:32:12.1256293Z     
2025-05-07T20:32:12.1256488Z         if scale_ub is not None:
2025-05-07T20:32:12.1256767Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.1257112Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.1257422Z             )
2025-05-07T20:32:12.1257621Z         else:
2025-05-07T20:32:12.1257839Z             scale_ub_tensor = None
2025-05-07T20:32:12.1258174Z     
2025-05-07T20:32:12.1258418Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.1258739Z             op = silu_mul_quant
2025-05-07T20:32:12.1258992Z             if compiled:
2025-05-07T20:32:12.1259248Z                 op = torch.compile(op)
2025-05-07T20:32:12.1259551Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1259825Z     
2025-05-07T20:32:12.1260018Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.1260192Z 
2025-05-07T20:32:12.1260295Z moe/activation_test.py:117: 
2025-05-07T20:32:12.1260600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1260925Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.1261219Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.1261791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.1262353Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.1263143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.1263849Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.1264396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.1265081Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.1265750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.1266298Z     kernel = self.compile(
2025-05-07T20:32:12.1266851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.1267519Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.1267925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.1268150Z 
2025-05-07T20:32:12.1268374Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb2a71f0>
2025-05-07T20:32:12.1269470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.1270872Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb2a0310>}
2025-05-07T20:32:12.1272327Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.1273400Z context = <triton._C.libtriton.ir.context object at 0x7faaeb233270>
2025-05-07T20:32:12.1273691Z 
2025-05-07T20:32:12.1273870Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.1274402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.1274880Z                            module_map=module_map)
2025-05-07T20:32:12.1275315Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.1275675Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.1275936Z E       ^
2025-05-07T20:32:12.1276408Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.1276866Z 
2025-05-07T20:32:12.1277295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.1277814Z 
2025-05-07T20:32:12.1277935Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.1278353Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.1278762Z     T=128,
2025-05-07T20:32:12.1278958Z     D=5120,
2025-05-07T20:32:12.1279155Z     scale_ub=1200.0,
2025-05-07T20:32:12.1279382Z     contiguous=False,
2025-05-07T20:32:12.1279612Z     compiled=True,
2025-05-07T20:32:12.1279814Z )
2025-05-07T20:32:12.4257580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4258889Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.4259745Z 
2025-05-07T20:32:12.4259973Z     @given(
2025-05-07T20:32:12.4260589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4261282Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4261890Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4262227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4262567Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4262853Z     )
2025-05-07T20:32:12.4263212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4263823Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4264070Z         self,
2025-05-07T20:32:12.4264269Z         T: int,
2025-05-07T20:32:12.4264475Z         D: int,
2025-05-07T20:32:12.4264698Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4264976Z         contiguous: bool,
2025-05-07T20:32:12.4265218Z         compiled: bool,
2025-05-07T20:32:12.4265445Z     ) -> None:
2025-05-07T20:32:12.4265664Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4265909Z     
2025-05-07T20:32:12.4266188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4266538Z     
2025-05-07T20:32:12.4266739Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4267034Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4267348Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4267588Z         x0 = x[:, :D]
2025-05-07T20:32:12.4267811Z         x1 = x[:, D:]
2025-05-07T20:32:12.4268019Z     
2025-05-07T20:32:12.4268214Z         if contiguous:
2025-05-07T20:32:12.4268455Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4268717Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4268964Z     
2025-05-07T20:32:12.4269168Z         if scale_ub is not None:
2025-05-07T20:32:12.4269446Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4269789Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4270102Z             )
2025-05-07T20:32:12.4270294Z         else:
2025-05-07T20:32:12.4270508Z             scale_ub_tensor = None
2025-05-07T20:32:12.4270833Z     
2025-05-07T20:32:12.4271067Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4271387Z             op = silu_mul_quant
2025-05-07T20:32:12.4271641Z             if compiled:
2025-05-07T20:32:12.4271887Z                 op = torch.compile(op)
2025-05-07T20:32:12.4272194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4272481Z     
2025-05-07T20:32:12.4272702Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4283942Z 
2025-05-07T20:32:12.4284080Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4284400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4284895Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4285188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4285762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4286330Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4286995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4287696Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4288241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4288923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4289599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4290130Z     kernel = self.compile(
2025-05-07T20:32:12.4290685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4291345Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4291756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4292024Z 
2025-05-07T20:32:12.4292248Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb522380>
2025-05-07T20:32:12.4293349Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4294839Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb2a1090>}
2025-05-07T20:32:12.4296212Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4297259Z context = <triton._C.libtriton.ir.context object at 0x7faaeb1b1970>
2025-05-07T20:32:12.4297551Z 
2025-05-07T20:32:12.4297728Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4298317Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4298797Z                            module_map=module_map)
2025-05-07T20:32:12.4299169Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4299531Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4299791Z E       ^
2025-05-07T20:32:12.4300268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4300724Z 
2025-05-07T20:32:12.4301153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4301676Z 
2025-05-07T20:32:12.4301790Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.4302253Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.4302656Z     T=16384,
2025-05-07T20:32:12.4302908Z     D=7168,
2025-05-07T20:32:12.4303099Z     scale_ub=1200.0,
2025-05-07T20:32:12.4303323Z     contiguous=True,
2025-05-07T20:32:12.4303548Z     compiled=True,
2025-05-07T20:32:12.4303749Z )
2025-05-07T20:32:12.4304075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.4304578Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.4304855Z 
2025-05-07T20:32:12.4304933Z     @given(
2025-05-07T20:32:12.4305172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.4305489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.4305799Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.4306173Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.4306513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.4306804Z     )
2025-05-07T20:32:12.4307153Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.4307603Z     def test_silu_mul_quant(
2025-05-07T20:32:12.4307848Z         self,
2025-05-07T20:32:12.4308038Z         T: int,
2025-05-07T20:32:12.4308233Z         D: int,
2025-05-07T20:32:12.4308451Z         scale_ub: Optional[float],
2025-05-07T20:32:12.4308717Z         contiguous: bool,
2025-05-07T20:32:12.4308961Z         compiled: bool,
2025-05-07T20:32:12.4309189Z     ) -> None:
2025-05-07T20:32:12.4309399Z         torch.manual_seed(2025)
2025-05-07T20:32:12.4309640Z     
2025-05-07T20:32:12.4309920Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.4310261Z     
2025-05-07T20:32:12.4310452Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.4310757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.4311060Z         x = x_sign * x_clamp
2025-05-07T20:32:12.4311302Z         x0 = x[:, :D]
2025-05-07T20:32:12.4311517Z         x1 = x[:, D:]
2025-05-07T20:32:12.4311717Z     
2025-05-07T20:32:12.4311909Z         if contiguous:
2025-05-07T20:32:12.4312150Z             x0 = x0.contiguous()
2025-05-07T20:32:12.4312447Z             x1 = x1.contiguous()
2025-05-07T20:32:12.4312686Z     
2025-05-07T20:32:12.4312884Z         if scale_ub is not None:
2025-05-07T20:32:12.4313154Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.4313495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.4313801Z             )
2025-05-07T20:32:12.4313992Z         else:
2025-05-07T20:32:12.4314286Z             scale_ub_tensor = None
2025-05-07T20:32:12.4314540Z     
2025-05-07T20:32:12.4314767Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.4315082Z             op = silu_mul_quant
2025-05-07T20:32:12.4315333Z             if compiled:
2025-05-07T20:32:12.4315582Z                 op = torch.compile(op)
2025-05-07T20:32:12.4315880Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4316158Z     
2025-05-07T20:32:12.4316358Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.4316523Z 
2025-05-07T20:32:12.4316621Z moe/activation_test.py:117: 
2025-05-07T20:32:12.4316920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4317248Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.4317529Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.4318093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.4318669Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.4319334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.4320026Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.4320567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.4321259Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.4321974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.4322561Z     kernel = self.compile(
2025-05-07T20:32:12.4323107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.4323768Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.4324165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.4324396Z 
2025-05-07T20:32:12.4324605Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb23c8b0>
2025-05-07T20:32:12.4325749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.4327142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb2a2290>}
2025-05-07T20:32:12.4328502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.4329538Z context = <triton._C.libtriton.ir.context object at 0x7faaeb1c83b0>
2025-05-07T20:32:12.4329833Z 
2025-05-07T20:32:12.4330003Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.4330531Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.4331004Z                            module_map=module_map)
2025-05-07T20:32:12.4331415Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.4331813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.4332097Z E       ^
2025-05-07T20:32:12.4332638Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.4333198Z 
2025-05-07T20:32:12.4333702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.4334330Z 
2025-05-07T20:32:12.5686022Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.5686477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.5687063Z     T=16384,
2025-05-07T20:32:12.5687278Z     D=5120,
2025-05-07T20:32:12.5687483Z     scale_ub=1200.0,
2025-05-07T20:32:12.5687710Z     contiguous=True,
2025-05-07T20:32:12.5687945Z     compiled=False,
2025-05-07T20:32:12.5688158Z )
2025-05-07T20:32:12.5688493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.5689119Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.5689466Z 
2025-05-07T20:32:12.5689547Z     @given(
2025-05-07T20:32:12.5689789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.5690103Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.5690418Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.5690758Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.5691086Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.5691384Z     )
2025-05-07T20:32:12.5691747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.5692192Z     def test_silu_mul_quant(
2025-05-07T20:32:12.5692434Z         self,
2025-05-07T20:32:12.5692638Z         T: int,
2025-05-07T20:32:12.5692838Z         D: int,
2025-05-07T20:32:12.5693061Z         scale_ub: Optional[float],
2025-05-07T20:32:12.5693343Z         contiguous: bool,
2025-05-07T20:32:12.5693587Z         compiled: bool,
2025-05-07T20:32:12.5693852Z     ) -> None:
2025-05-07T20:32:12.5694071Z         torch.manual_seed(2025)
2025-05-07T20:32:12.5694436Z     
2025-05-07T20:32:12.5694716Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.5695062Z     
2025-05-07T20:32:12.5695253Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.5695553Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.5695865Z         x = x_sign * x_clamp
2025-05-07T20:32:12.5696103Z         x0 = x[:, :D]
2025-05-07T20:32:12.5696324Z         x1 = x[:, D:]
2025-05-07T20:32:12.5696533Z     
2025-05-07T20:32:12.5696724Z         if contiguous:
2025-05-07T20:32:12.5696963Z             x0 = x0.contiguous()
2025-05-07T20:32:12.5697229Z             x1 = x1.contiguous()
2025-05-07T20:32:12.5697540Z     
2025-05-07T20:32:12.5697739Z         if scale_ub is not None:
2025-05-07T20:32:12.5698110Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.5698446Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.5698757Z             )
2025-05-07T20:32:12.5698952Z         else:
2025-05-07T20:32:12.5699165Z             scale_ub_tensor = None
2025-05-07T20:32:12.5699421Z     
2025-05-07T20:32:12.5699660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.5699977Z             op = silu_mul_quant
2025-05-07T20:32:12.5700231Z             if compiled:
2025-05-07T20:32:12.5700484Z                 op = torch.compile(op)
2025-05-07T20:32:12.5700787Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.5701060Z     
2025-05-07T20:32:12.5701263Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.5701430Z 
2025-05-07T20:32:12.5701537Z moe/activation_test.py:117: 
2025-05-07T20:32:12.5701834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.5702171Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.5702459Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.5703156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.5703860Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.5704405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.5705106Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.5705775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.5706405Z     kernel = self.compile(
2025-05-07T20:32:12.5706959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.5707628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.5708023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.5708255Z 
2025-05-07T20:32:12.5708469Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb10eaa0>
2025-05-07T20:32:12.5709565Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.5710969Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb2a11b0>}
2025-05-07T20:32:12.5712337Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.5713381Z context = <triton._C.libtriton.ir.context object at 0x7fac9c0f88b0>
2025-05-07T20:32:12.5713679Z 
2025-05-07T20:32:12.5713848Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.5714385Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.5714908Z                            module_map=module_map)
2025-05-07T20:32:12.5715278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.5715645Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.5715912Z E       ^
2025-05-07T20:32:12.5716382Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.5716844Z 
2025-05-07T20:32:12.5717271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.5717796Z 
2025-05-07T20:32:12.5717948Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.5718369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.5718777Z     T=1,
2025-05-07T20:32:12.5718968Z     D=7168,
2025-05-07T20:32:12.5719169Z     scale_ub=1200.0,
2025-05-07T20:32:12.5719397Z     contiguous=False,
2025-05-07T20:32:12.5719636Z     compiled=False,
2025-05-07T20:32:12.5719844Z )
2025-05-07T20:32:12.5720165Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.5720671Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.5720952Z 
2025-05-07T20:32:12.5721031Z     @given(
2025-05-07T20:32:12.5721269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.5721582Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.5721900Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.5722238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.5722568Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.5722858Z     )
2025-05-07T20:32:12.5723211Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.5723652Z     def test_silu_mul_quant(
2025-05-07T20:32:12.5723897Z         self,
2025-05-07T20:32:12.5724097Z         T: int,
2025-05-07T20:32:12.5724294Z         D: int,
2025-05-07T20:32:12.5724517Z         scale_ub: Optional[float],
2025-05-07T20:32:12.5724798Z         contiguous: bool,
2025-05-07T20:32:12.5725043Z         compiled: bool,
2025-05-07T20:32:12.5725264Z     ) -> None:
2025-05-07T20:32:12.5725492Z         torch.manual_seed(2025)
2025-05-07T20:32:12.5725742Z     
2025-05-07T20:32:12.5726016Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.5726364Z     
2025-05-07T20:32:12.5726649Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.5726947Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.5727264Z         x = x_sign * x_clamp
2025-05-07T20:32:12.5727511Z         x0 = x[:, :D]
2025-05-07T20:32:12.5727727Z         x1 = x[:, D:]
2025-05-07T20:32:12.5727937Z     
2025-05-07T20:32:12.5728134Z         if contiguous:
2025-05-07T20:32:12.5728364Z             x0 = x0.contiguous()
2025-05-07T20:32:12.5728627Z             x1 = x1.contiguous()
2025-05-07T20:32:12.5728873Z     
2025-05-07T20:32:12.5729063Z         if scale_ub is not None:
2025-05-07T20:32:12.5729347Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.5729693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.5730004Z             )
2025-05-07T20:32:12.5730204Z         else:
2025-05-07T20:32:12.5730419Z             scale_ub_tensor = None
2025-05-07T20:32:12.5730678Z     
2025-05-07T20:32:12.5730918Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.5731244Z             op = silu_mul_quant
2025-05-07T20:32:12.5731503Z             if compiled:
2025-05-07T20:32:12.5731759Z                 op = torch.compile(op)
2025-05-07T20:32:12.5732062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.5732344Z     
2025-05-07T20:32:12.5732538Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.5732711Z 
2025-05-07T20:32:12.5732812Z moe/activation_test.py:117: 
2025-05-07T20:32:12.5733114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.5733489Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.5733780Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.5734480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.5735181Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.5735727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.5736419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.5737138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.5737675Z     kernel = self.compile(
2025-05-07T20:32:12.5738283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.5738954Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.5739359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.5739584Z 
2025-05-07T20:32:12.5739796Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb520bb0>
2025-05-07T20:32:12.5740893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.5742291Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb2a2680>}
2025-05-07T20:32:12.5743662Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.5744704Z context = <triton._C.libtriton.ir.context object at 0x7faaeb4850b0>
2025-05-07T20:32:12.5744997Z 
2025-05-07T20:32:12.5745169Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.5745700Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.5746178Z                            module_map=module_map)
2025-05-07T20:32:12.5746627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.5746989Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.5747253Z E       ^
2025-05-07T20:32:12.5747730Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.5748185Z 
2025-05-07T20:32:12.5748607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.5749132Z 
2025-05-07T20:32:12.7665451Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.7666139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.7666737Z     T=4096,
2025-05-07T20:32:12.7667000Z     D=7168,
2025-05-07T20:32:12.7667272Z     scale_ub=1200.0,
2025-05-07T20:32:12.7667601Z     contiguous=False,
2025-05-07T20:32:12.7667833Z     compiled=True,
2025-05-07T20:32:12.7668040Z )
2025-05-07T20:32:12.7668375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.7668880Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.7669160Z 
2025-05-07T20:32:12.7669239Z     @given(
2025-05-07T20:32:12.7669472Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.7669789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.7670093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.7670433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.7670909Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.7671191Z     )
2025-05-07T20:32:12.7671548Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.7671995Z     def test_silu_mul_quant(
2025-05-07T20:32:12.7672231Z         self,
2025-05-07T20:32:12.7672430Z         T: int,
2025-05-07T20:32:12.7672642Z         D: int,
2025-05-07T20:32:12.7672891Z         scale_ub: Optional[float],
2025-05-07T20:32:12.7673170Z         contiguous: bool,
2025-05-07T20:32:12.7673434Z         compiled: bool,
2025-05-07T20:32:12.7673663Z     ) -> None:
2025-05-07T20:32:12.7673877Z         torch.manual_seed(2025)
2025-05-07T20:32:12.7674247Z     
2025-05-07T20:32:12.7674523Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.7674864Z     
2025-05-07T20:32:12.7675058Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.7675354Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.7675662Z         x = x_sign * x_clamp
2025-05-07T20:32:12.7675900Z         x0 = x[:, :D]
2025-05-07T20:32:12.7676114Z         x1 = x[:, D:]
2025-05-07T20:32:12.7676321Z     
2025-05-07T20:32:12.7676502Z         if contiguous:
2025-05-07T20:32:12.7676732Z             x0 = x0.contiguous()
2025-05-07T20:32:12.7676992Z             x1 = x1.contiguous()
2025-05-07T20:32:12.7677227Z     
2025-05-07T20:32:12.7677420Z         if scale_ub is not None:
2025-05-07T20:32:12.7677699Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.7678034Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.7678344Z             )
2025-05-07T20:32:12.7678542Z         else:
2025-05-07T20:32:12.7678747Z             scale_ub_tensor = None
2025-05-07T20:32:12.7678998Z     
2025-05-07T20:32:12.7679236Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.7679545Z             op = silu_mul_quant
2025-05-07T20:32:12.7679793Z             if compiled:
2025-05-07T20:32:12.7680039Z                 op = torch.compile(op)
2025-05-07T20:32:12.7680342Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.7680611Z     
2025-05-07T20:32:12.7680804Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.7680972Z 
2025-05-07T20:32:12.7681078Z moe/activation_test.py:117: 
2025-05-07T20:32:12.7681368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.7681700Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.7682107Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.7682677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.7683252Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.7683921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.7684623Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.7685161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.7685855Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.7686524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.7687059Z     kernel = self.compile(
2025-05-07T20:32:12.7687613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.7688284Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.7688679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.7688906Z 
2025-05-07T20:32:12.7689119Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb4bdea0>
2025-05-07T20:32:12.7690218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.7692139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb2a3b50>}
2025-05-07T20:32:12.7693737Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.7694787Z context = <triton._C.libtriton.ir.context object at 0x7faaeb4a96f0>
2025-05-07T20:32:12.7695136Z 
2025-05-07T20:32:12.7695306Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.7695834Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.7696310Z                            module_map=module_map)
2025-05-07T20:32:12.7696675Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.7697033Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.7697291Z E       ^
2025-05-07T20:32:12.7697763Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.7698305Z 
2025-05-07T20:32:12.7698726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.7699257Z 
2025-05-07T20:32:12.7699362Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.7699776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.7700179Z     T=128,
2025-05-07T20:32:12.7700360Z     D=7168,
2025-05-07T20:32:12.7700553Z     scale_ub=1200.0,
2025-05-07T20:32:12.7700784Z     contiguous=False,
2025-05-07T20:32:12.7701002Z     compiled=True,
2025-05-07T20:32:12.7701205Z )
2025-05-07T20:32:12.8725332Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.8726150Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.8726551Z 
2025-05-07T20:32:12.8733551Z     @given(
2025-05-07T20:32:12.8733934Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.8734370Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.8734681Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.8735207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.8735547Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.8735829Z     )
2025-05-07T20:32:12.8736191Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.8736641Z     def test_silu_mul_quant(
2025-05-07T20:32:12.8736882Z         self,
2025-05-07T20:32:12.8737078Z         T: int,
2025-05-07T20:32:12.8737278Z         D: int,
2025-05-07T20:32:12.8737491Z         scale_ub: Optional[float],
2025-05-07T20:32:12.8737776Z         contiguous: bool,
2025-05-07T20:32:12.8738100Z         compiled: bool,
2025-05-07T20:32:12.8738332Z     ) -> None:
2025-05-07T20:32:12.8738554Z         torch.manual_seed(2025)
2025-05-07T20:32:12.8738801Z     
2025-05-07T20:32:12.8739081Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.8739424Z     
2025-05-07T20:32:12.8739620Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.8739924Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.8740237Z         x = x_sign * x_clamp
2025-05-07T20:32:12.8740479Z         x0 = x[:, :D]
2025-05-07T20:32:12.8740700Z         x1 = x[:, D:]
2025-05-07T20:32:12.8740907Z     
2025-05-07T20:32:12.8741098Z         if contiguous:
2025-05-07T20:32:12.8741335Z             x0 = x0.contiguous()
2025-05-07T20:32:12.8741597Z             x1 = x1.contiguous()
2025-05-07T20:32:12.8741841Z     
2025-05-07T20:32:12.8742041Z         if scale_ub is not None:
2025-05-07T20:32:12.8742316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.8742772Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.8743103Z             )
2025-05-07T20:32:12.8743294Z         else:
2025-05-07T20:32:12.8743508Z             scale_ub_tensor = None
2025-05-07T20:32:12.8743766Z     
2025-05-07T20:32:12.8744008Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.8744324Z             op = silu_mul_quant
2025-05-07T20:32:12.8744585Z             if compiled:
2025-05-07T20:32:12.8744841Z                 op = torch.compile(op)
2025-05-07T20:32:12.8745142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8745491Z     
2025-05-07T20:32:12.8745688Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.8745858Z 
2025-05-07T20:32:12.8745964Z moe/activation_test.py:117: 
2025-05-07T20:32:12.8746269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8746603Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.8746889Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8747465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.8748046Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.8748722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.8749423Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.8749980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.8750682Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.8751365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.8751907Z     kernel = self.compile(
2025-05-07T20:32:12.8752465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.8753189Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.8753591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8753819Z 
2025-05-07T20:32:12.8754037Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb1281c0>
2025-05-07T20:32:12.8755228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.8756974Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb318670>}
2025-05-07T20:32:12.8758359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.8759419Z context = <triton._C.libtriton.ir.context object at 0x7faaeb3aff30>
2025-05-07T20:32:12.8759714Z 
2025-05-07T20:32:12.8759896Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.8760424Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.8760911Z                            module_map=module_map)
2025-05-07T20:32:12.8761288Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.8761651Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.8761909Z E       ^
2025-05-07T20:32:12.8762386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.8762847Z 
2025-05-07T20:32:12.8763276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.8763871Z 
2025-05-07T20:32:12.8763982Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.8764398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.8764803Z     T=2048,
2025-05-07T20:32:12.8764995Z     D=7168,
2025-05-07T20:32:12.8765185Z     scale_ub=None,
2025-05-07T20:32:12.8765400Z     contiguous=True,
2025-05-07T20:32:12.8765633Z     compiled=True,
2025-05-07T20:32:12.8765831Z )
2025-05-07T20:32:12.8766164Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.8766669Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:12.8767007Z 
2025-05-07T20:32:12.8767083Z     @given(
2025-05-07T20:32:12.8767314Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.8767632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.8767942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.8768273Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.8768608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.8768897Z     )
2025-05-07T20:32:12.8769248Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.8769696Z     def test_silu_mul_quant(
2025-05-07T20:32:12.8769938Z         self,
2025-05-07T20:32:12.8770153Z         T: int,
2025-05-07T20:32:12.8770343Z         D: int,
2025-05-07T20:32:12.8770570Z         scale_ub: Optional[float],
2025-05-07T20:32:12.8770847Z         contiguous: bool,
2025-05-07T20:32:12.8771082Z         compiled: bool,
2025-05-07T20:32:12.8771312Z     ) -> None:
2025-05-07T20:32:12.8771531Z         torch.manual_seed(2025)
2025-05-07T20:32:12.8771768Z     
2025-05-07T20:32:12.8772049Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.8772399Z     
2025-05-07T20:32:12.8772610Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.8772933Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.8773249Z         x = x_sign * x_clamp
2025-05-07T20:32:12.8773491Z         x0 = x[:, :D]
2025-05-07T20:32:12.8773706Z         x1 = x[:, D:]
2025-05-07T20:32:12.8773910Z     
2025-05-07T20:32:12.8774096Z         if contiguous:
2025-05-07T20:32:12.8774327Z             x0 = x0.contiguous()
2025-05-07T20:32:12.8774585Z             x1 = x1.contiguous()
2025-05-07T20:32:12.8774833Z     
2025-05-07T20:32:12.8775024Z         if scale_ub is not None:
2025-05-07T20:32:12.8775445Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.8775787Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.8776098Z             )
2025-05-07T20:32:12.8776290Z         else:
2025-05-07T20:32:12.8776499Z             scale_ub_tensor = None
2025-05-07T20:32:12.8776750Z     
2025-05-07T20:32:12.8776985Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.8777305Z             op = silu_mul_quant
2025-05-07T20:32:12.8777555Z             if compiled:
2025-05-07T20:32:12.8777806Z                 op = torch.compile(op)
2025-05-07T20:32:12.8778179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8778447Z     
2025-05-07T20:32:12.8778643Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.8778816Z 
2025-05-07T20:32:12.8778915Z moe/activation_test.py:117: 
2025-05-07T20:32:12.8779208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8779686Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.8780069Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8780750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.8781320Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.8781989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.8782693Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.8783356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.8784043Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.8784718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.8785256Z     kernel = self.compile(
2025-05-07T20:32:12.8785805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.8786473Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.8786918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8787144Z 
2025-05-07T20:32:12.8787358Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb4b4220>
2025-05-07T20:32:12.8788453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.8789854Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb3191b0>}
2025-05-07T20:32:12.8791227Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.8792273Z context = <triton._C.libtriton.ir.context object at 0x7faaeb38eb30>
2025-05-07T20:32:12.8792569Z 
2025-05-07T20:32:12.8792740Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.8793262Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.8793735Z                            module_map=module_map)
2025-05-07T20:32:12.8794104Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.8794459Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.8794721Z E       ^
2025-05-07T20:32:12.8795191Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.8795648Z 
2025-05-07T20:32:12.8796159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.8796680Z 
2025-05-07T20:32:12.9599131Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.9599799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.9600370Z     T=16384,
2025-05-07T20:32:12.9600652Z     D=5120,
2025-05-07T20:32:12.9600845Z     scale_ub=None,
2025-05-07T20:32:12.9601068Z     contiguous=False,
2025-05-07T20:32:12.9601300Z     compiled=False,
2025-05-07T20:32:12.9601514Z )
2025-05-07T20:32:12.9601836Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.9602347Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.9602631Z 
2025-05-07T20:32:12.9602713Z     @given(
2025-05-07T20:32:12.9602945Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.9603270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.9603588Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.9603919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.9604257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.9604553Z     )
2025-05-07T20:32:12.9604915Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.9605363Z     def test_silu_mul_quant(
2025-05-07T20:32:12.9605607Z         self,
2025-05-07T20:32:12.9605809Z         T: int,
2025-05-07T20:32:12.9606004Z         D: int,
2025-05-07T20:32:12.9606338Z         scale_ub: Optional[float],
2025-05-07T20:32:12.9606616Z         contiguous: bool,
2025-05-07T20:32:12.9606855Z         compiled: bool,
2025-05-07T20:32:12.9607083Z     ) -> None:
2025-05-07T20:32:12.9607306Z         torch.manual_seed(2025)
2025-05-07T20:32:12.9607551Z     
2025-05-07T20:32:12.9607828Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.9608172Z     
2025-05-07T20:32:12.9608364Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.9608673Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.9610770Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.9612797Z 
2025-05-07T20:32:12.9612917Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:12.9613140Z 
2025-05-07T20:32:12.9613247Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.9613672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.9614082Z     T=4096,
2025-05-07T20:32:12.9614271Z     D=7168,
2025-05-07T20:32:12.9614465Z     scale_ub=1200.0,
2025-05-07T20:32:12.9614687Z     contiguous=True,
2025-05-07T20:32:12.9614914Z     compiled=True,
2025-05-07T20:32:12.9615120Z )
2025-05-07T20:32:12.9615445Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.9615946Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.9616251Z 
2025-05-07T20:32:12.9616370Z     @given(
2025-05-07T20:32:12.9616662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.9616981Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.9617297Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.9617635Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.9617967Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.9618343Z     )
2025-05-07T20:32:12.9618835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.9619284Z     def test_silu_mul_quant(
2025-05-07T20:32:12.9619526Z         self,
2025-05-07T20:32:12.9619722Z         T: int,
2025-05-07T20:32:12.9619921Z         D: int,
2025-05-07T20:32:12.9620143Z         scale_ub: Optional[float],
2025-05-07T20:32:12.9620418Z         contiguous: bool,
2025-05-07T20:32:12.9620659Z         compiled: bool,
2025-05-07T20:32:12.9620887Z     ) -> None:
2025-05-07T20:32:12.9621109Z         torch.manual_seed(2025)
2025-05-07T20:32:12.9621355Z     
2025-05-07T20:32:12.9621634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.9621980Z     
2025-05-07T20:32:12.9622176Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.9622478Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.9624550Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.9626478Z 
2025-05-07T20:32:12.9626600Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:12.9626867Z 
2025-05-07T20:32:12.9626979Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.9627397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.9627807Z     T=16384,
2025-05-07T20:32:12.9628003Z     D=7168,
2025-05-07T20:32:12.9628195Z     scale_ub=None,
2025-05-07T20:32:12.9628409Z     contiguous=False,
2025-05-07T20:32:12.9628644Z     compiled=False,
2025-05-07T20:32:12.9628853Z )
2025-05-07T20:32:12.9629176Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.9629680Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:12.9630009Z 
2025-05-07T20:32:12.9630089Z     @given(
2025-05-07T20:32:12.9630319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.9630643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.9630958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.9631291Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.9631634Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.9631925Z     )
2025-05-07T20:32:12.9632283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.9632726Z     def test_silu_mul_quant(
2025-05-07T20:32:12.9632975Z         self,
2025-05-07T20:32:12.9633172Z         T: int,
2025-05-07T20:32:12.9633373Z         D: int,
2025-05-07T20:32:12.9633595Z         scale_ub: Optional[float],
2025-05-07T20:32:12.9633875Z         contiguous: bool,
2025-05-07T20:32:12.9634115Z         compiled: bool,
2025-05-07T20:32:12.9634343Z     ) -> None:
2025-05-07T20:32:12.9634570Z         torch.manual_seed(2025)
2025-05-07T20:32:12.9634810Z     
2025-05-07T20:32:12.9635096Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.9637236Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.9639170Z 
2025-05-07T20:32:12.9639374Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:12.9639592Z 
2025-05-07T20:32:12.9639703Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.9640116Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.9640524Z     T=2048,
2025-05-07T20:32:12.9640715Z     D=7168,
2025-05-07T20:32:12.9640912Z     scale_ub=1200.0,
2025-05-07T20:32:12.9641138Z     contiguous=True,
2025-05-07T20:32:12.9641364Z     compiled=True,
2025-05-07T20:32:12.9641571Z )
2025-05-07T20:32:12.9641893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.9642392Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.9642677Z 
2025-05-07T20:32:12.9642781Z     @given(
2025-05-07T20:32:12.9643035Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.9643353Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.9643667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.9644002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.9644341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.9644636Z     )
2025-05-07T20:32:12.9644995Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.9645438Z     def test_silu_mul_quant(
2025-05-07T20:32:12.9645687Z         self,
2025-05-07T20:32:12.9645885Z         T: int,
2025-05-07T20:32:12.9646084Z         D: int,
2025-05-07T20:32:12.9646312Z         scale_ub: Optional[float],
2025-05-07T20:32:12.9646657Z         contiguous: bool,
2025-05-07T20:32:12.9646899Z         compiled: bool,
2025-05-07T20:32:12.9647129Z     ) -> None:
2025-05-07T20:32:12.9647353Z         torch.manual_seed(2025)
2025-05-07T20:32:12.9647594Z     
2025-05-07T20:32:12.9647870Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.9648218Z     
2025-05-07T20:32:12.9648411Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.9648711Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.9650769Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:12.9652760Z 
2025-05-07T20:32:12.9652904Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:12.9653149Z 
2025-05-07T20:32:12.9653260Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.9653677Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.9654086Z     T=2048,
2025-05-07T20:32:12.9654283Z     D=7168,
2025-05-07T20:32:12.9654473Z     scale_ub=None,
2025-05-07T20:32:12.9654692Z     contiguous=True,
2025-05-07T20:32:12.9654925Z     compiled=False,
2025-05-07T20:32:12.9655131Z )
2025-05-07T20:32:13.0924491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0925970Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.0926758Z 
2025-05-07T20:32:13.0926938Z     @given(
2025-05-07T20:32:13.0927409Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0928046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0928663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0929330Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0929988Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0930561Z     )
2025-05-07T20:32:13.0931264Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0932454Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0932726Z         self,
2025-05-07T20:32:13.0932926Z         T: int,
2025-05-07T20:32:13.0933128Z         D: int,
2025-05-07T20:32:13.0933347Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0933630Z         contiguous: bool,
2025-05-07T20:32:13.0933873Z         compiled: bool,
2025-05-07T20:32:13.0934097Z     ) -> None:
2025-05-07T20:32:13.0934317Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0934564Z     
2025-05-07T20:32:13.0934841Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0935189Z     
2025-05-07T20:32:13.0935391Z >       x_sign = torch.sign(x)
2025-05-07T20:32:13.0937402Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.0939432Z 
2025-05-07T20:32:13.0939553Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:13.0939773Z 
2025-05-07T20:32:13.0939879Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0940302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0940783Z     T=1,
2025-05-07T20:32:13.0940968Z     D=7168,
2025-05-07T20:32:13.0941165Z     scale_ub=1200.0,
2025-05-07T20:32:13.0941394Z     contiguous=True,
2025-05-07T20:32:13.0941614Z     compiled=False,
2025-05-07T20:32:13.0941822Z )
2025-05-07T20:32:13.0942149Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0942646Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:13.0942926Z 
2025-05-07T20:32:13.0943005Z     @given(
2025-05-07T20:32:13.0943244Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0943629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0943941Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0944274Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0944612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0944896Z     )
2025-05-07T20:32:13.0945259Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0945710Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0945950Z         self,
2025-05-07T20:32:13.0946150Z         T: int,
2025-05-07T20:32:13.0946349Z         D: int,
2025-05-07T20:32:13.0946566Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0946846Z         contiguous: bool,
2025-05-07T20:32:13.0947091Z         compiled: bool,
2025-05-07T20:32:13.0947319Z     ) -> None:
2025-05-07T20:32:13.0947544Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0947787Z     
2025-05-07T20:32:13.0948058Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0948405Z     
2025-05-07T20:32:13.0948603Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.0948901Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.0949209Z         x = x_sign * x_clamp
2025-05-07T20:32:13.0949452Z         x0 = x[:, :D]
2025-05-07T20:32:13.0949671Z         x1 = x[:, D:]
2025-05-07T20:32:13.0949877Z     
2025-05-07T20:32:13.0950066Z         if contiguous:
2025-05-07T20:32:13.0950299Z             x0 = x0.contiguous()
2025-05-07T20:32:13.0950558Z             x1 = x1.contiguous()
2025-05-07T20:32:13.0950803Z     
2025-05-07T20:32:13.0950995Z         if scale_ub is not None:
2025-05-07T20:32:13.0951270Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.0951695Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.0952007Z             )
2025-05-07T20:32:13.0952197Z         else:
2025-05-07T20:32:13.0952409Z             scale_ub_tensor = None
2025-05-07T20:32:13.0952667Z     
2025-05-07T20:32:13.0952904Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0953224Z             op = silu_mul_quant
2025-05-07T20:32:13.0953477Z             if compiled:
2025-05-07T20:32:13.0953725Z                 op = torch.compile(op)
2025-05-07T20:32:13.0954032Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0954314Z     
2025-05-07T20:32:13.0954511Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.0954678Z 
2025-05-07T20:32:13.0954779Z moe/activation_test.py:117: 
2025-05-07T20:32:13.0960495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0960841Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.0961132Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0961848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.0962559Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.0963150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.0963838Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.0964508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.0965160Z     kernel = self.compile(
2025-05-07T20:32:13.0965711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.0966371Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.0966769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0966994Z 
2025-05-07T20:32:13.0967219Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb306fb0>
2025-05-07T20:32:13.0968327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.0969800Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb31ae60>}
2025-05-07T20:32:13.0971178Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.0972222Z context = <triton._C.libtriton.ir.context object at 0x7faaeb0316b0>
2025-05-07T20:32:13.0972519Z 
2025-05-07T20:32:13.0972696Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.0973273Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.0973744Z                            module_map=module_map)
2025-05-07T20:32:13.0974112Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.0974466Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.0974723Z E       ^
2025-05-07T20:32:13.0975197Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0975654Z 
2025-05-07T20:32:13.0976079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0976599Z 
2025-05-07T20:32:13.0976703Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0977122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0977529Z     T=128,
2025-05-07T20:32:13.0977720Z     D=5120,
2025-05-07T20:32:13.0978073Z     scale_ub=None,
2025-05-07T20:32:13.0978291Z     contiguous=True,
2025-05-07T20:32:13.0978513Z     compiled=False,
2025-05-07T20:32:13.0978712Z )
2025-05-07T20:32:13.1748667Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.1749466Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.1749876Z 
2025-05-07T20:32:13.1749995Z     @given(
2025-05-07T20:32:13.1750317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.1750774Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.1751200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.1751539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.1751871Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.1752160Z     )
2025-05-07T20:32:13.1752521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.1753022Z     def test_silu_mul_quant(
2025-05-07T20:32:13.1753276Z         self,
2025-05-07T20:32:13.1753473Z         T: int,
2025-05-07T20:32:13.1753669Z         D: int,
2025-05-07T20:32:13.1753894Z         scale_ub: Optional[float],
2025-05-07T20:32:13.1754172Z         contiguous: bool,
2025-05-07T20:32:13.1754415Z         compiled: bool,
2025-05-07T20:32:13.1754648Z     ) -> None:
2025-05-07T20:32:13.1754868Z         torch.manual_seed(2025)
2025-05-07T20:32:13.1755109Z     
2025-05-07T20:32:13.1755388Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.1756023Z     
2025-05-07T20:32:13.1756217Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.1756518Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.1756832Z         x = x_sign * x_clamp
2025-05-07T20:32:13.1757070Z         x0 = x[:, :D]
2025-05-07T20:32:13.1757289Z         x1 = x[:, D:]
2025-05-07T20:32:13.1757502Z     
2025-05-07T20:32:13.1757689Z         if contiguous:
2025-05-07T20:32:13.1757931Z             x0 = x0.contiguous()
2025-05-07T20:32:13.1758195Z             x1 = x1.contiguous()
2025-05-07T20:32:13.1758436Z     
2025-05-07T20:32:13.1758631Z         if scale_ub is not None:
2025-05-07T20:32:13.1758987Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.1759324Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.1759641Z             )
2025-05-07T20:32:13.1759839Z         else:
2025-05-07T20:32:13.1760050Z             scale_ub_tensor = None
2025-05-07T20:32:13.1760309Z     
2025-05-07T20:32:13.1760551Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.1760867Z             op = silu_mul_quant
2025-05-07T20:32:13.1761122Z             if compiled:
2025-05-07T20:32:13.1761373Z                 op = torch.compile(op)
2025-05-07T20:32:13.1761679Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1761953Z     
2025-05-07T20:32:13.1762150Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.1762317Z 
2025-05-07T20:32:13.1762425Z moe/activation_test.py:117: 
2025-05-07T20:32:13.1762721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1763056Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.1763345Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1764046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.1764752Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.1765306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.1766001Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.1766673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.1767216Z     kernel = self.compile(
2025-05-07T20:32:13.1767890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.1768563Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.1768968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1769202Z 
2025-05-07T20:32:13.1769416Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb015c30>
2025-05-07T20:32:13.1770521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.1771927Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb31b7f0>}
2025-05-07T20:32:13.1773353Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.1774400Z context = <triton._C.libtriton.ir.context object at 0x7faaeaf1cb70>
2025-05-07T20:32:13.1774703Z 
2025-05-07T20:32:13.1774872Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.1775409Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.1775884Z                            module_map=module_map)
2025-05-07T20:32:13.1776349Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.1776713Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.1776979Z E       ^
2025-05-07T20:32:13.1777458Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.1777926Z 
2025-05-07T20:32:13.1778430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.1778949Z 
2025-05-07T20:32:13.1779059Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.1779528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.1779937Z     T=128,
2025-05-07T20:32:13.1780131Z     D=7168,
2025-05-07T20:32:13.1780402Z     scale_ub=None,
2025-05-07T20:32:13.1780721Z     contiguous=True,
2025-05-07T20:32:13.1781042Z     compiled=False,
2025-05-07T20:32:13.1781254Z )
2025-05-07T20:32:13.1781588Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.1782111Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.1782384Z 
2025-05-07T20:32:13.1782465Z     @given(
2025-05-07T20:32:13.1782700Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.1783060Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.1783377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.1783716Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.1784046Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.1784344Z     )
2025-05-07T20:32:13.1784700Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.1785146Z     def test_silu_mul_quant(
2025-05-07T20:32:13.1785394Z         self,
2025-05-07T20:32:13.1785591Z         T: int,
2025-05-07T20:32:13.1785787Z         D: int,
2025-05-07T20:32:13.1786012Z         scale_ub: Optional[float],
2025-05-07T20:32:13.1786289Z         contiguous: bool,
2025-05-07T20:32:13.1786534Z         compiled: bool,
2025-05-07T20:32:13.1786758Z     ) -> None:
2025-05-07T20:32:13.1786979Z         torch.manual_seed(2025)
2025-05-07T20:32:13.1787226Z     
2025-05-07T20:32:13.1787502Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.1787851Z     
2025-05-07T20:32:13.1788050Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.1788447Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.1788765Z         x = x_sign * x_clamp
2025-05-07T20:32:13.1789008Z         x0 = x[:, :D]
2025-05-07T20:32:13.1789227Z         x1 = x[:, D:]
2025-05-07T20:32:13.1789441Z     
2025-05-07T20:32:13.1789635Z         if contiguous:
2025-05-07T20:32:13.1789865Z             x0 = x0.contiguous()
2025-05-07T20:32:13.1790129Z             x1 = x1.contiguous()
2025-05-07T20:32:13.1790378Z     
2025-05-07T20:32:13.1790572Z         if scale_ub is not None:
2025-05-07T20:32:13.1790852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.1791191Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.1791500Z             )
2025-05-07T20:32:13.1791697Z         else:
2025-05-07T20:32:13.1791911Z             scale_ub_tensor = None
2025-05-07T20:32:13.1792172Z     
2025-05-07T20:32:13.1792412Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.1792737Z             op = silu_mul_quant
2025-05-07T20:32:13.1793005Z             if compiled:
2025-05-07T20:32:13.1793253Z                 op = torch.compile(op)
2025-05-07T20:32:13.1793553Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1793833Z     
2025-05-07T20:32:13.1794035Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.1794207Z 
2025-05-07T20:32:13.1794308Z moe/activation_test.py:117: 
2025-05-07T20:32:13.1794610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1794940Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.1795278Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1795984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.1796686Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.1797233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.1797934Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.1798612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.1799192Z     kernel = self.compile(
2025-05-07T20:32:13.1799747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.1800415Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.1800820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1801046Z 
2025-05-07T20:32:13.1801260Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeafb7f10>
2025-05-07T20:32:13.1802367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.1803832Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeaf78160>}
2025-05-07T20:32:13.1805206Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.1806248Z context = <triton._C.libtriton.ir.context object at 0x7faaeaf081b0>
2025-05-07T20:32:13.1806545Z 
2025-05-07T20:32:13.1806714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.1807244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.1807721Z                            module_map=module_map)
2025-05-07T20:32:13.1808088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.1808527Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.1808790Z E       ^
2025-05-07T20:32:13.1809265Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.1809727Z 
2025-05-07T20:32:13.1810150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.1810675Z 
2025-05-07T20:32:13.1810783Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.1811208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.1811616Z     T=2048,
2025-05-07T20:32:13.1811805Z     D=7168,
2025-05-07T20:32:13.1812001Z     scale_ub=1200.0,
2025-05-07T20:32:13.1812230Z     contiguous=True,
2025-05-07T20:32:13.1812452Z     compiled=False,
2025-05-07T20:32:13.1812661Z )
2025-05-07T20:32:13.2769909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.2770697Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:13.2771088Z 
2025-05-07T20:32:13.2771212Z     @given(
2025-05-07T20:32:13.2771533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.2771976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.2772412Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.2772864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.2773300Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.2773773Z     )
2025-05-07T20:32:13.2774125Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.2774572Z     def test_silu_mul_quant(
2025-05-07T20:32:13.2774819Z         self,
2025-05-07T20:32:13.2775015Z         T: int,
2025-05-07T20:32:13.2775207Z         D: int,
2025-05-07T20:32:13.2775428Z         scale_ub: Optional[float],
2025-05-07T20:32:13.2775703Z         contiguous: bool,
2025-05-07T20:32:13.2775939Z         compiled: bool,
2025-05-07T20:32:13.2776171Z     ) -> None:
2025-05-07T20:32:13.2776391Z         torch.manual_seed(2025)
2025-05-07T20:32:13.2776632Z     
2025-05-07T20:32:13.2776911Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.2779366Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.2781313Z 
2025-05-07T20:32:13.2781495Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.2781824Z 
2025-05-07T20:32:13.2781998Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.2782650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.2783220Z     T=1,
2025-05-07T20:32:13.2783646Z     D=5120,
2025-05-07T20:32:13.2783945Z     scale_ub=1200.0,
2025-05-07T20:32:13.2784516Z     contiguous=True,
2025-05-07T20:32:13.2784864Z     compiled=False,
2025-05-07T20:32:13.2785375Z )
2025-05-07T20:32:13.2785979Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.2786771Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:13.2787207Z 
2025-05-07T20:32:13.2787327Z     @given(
2025-05-07T20:32:13.2787698Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.2788236Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.2788694Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.2789161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.2789764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.2790066Z     )
2025-05-07T20:32:13.2790420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.2790870Z     def test_silu_mul_quant(
2025-05-07T20:32:13.2791119Z         self,
2025-05-07T20:32:13.2791319Z         T: int,
2025-05-07T20:32:13.2791512Z         D: int,
2025-05-07T20:32:13.2791733Z         scale_ub: Optional[float],
2025-05-07T20:32:13.2792005Z         contiguous: bool,
2025-05-07T20:32:13.2792245Z         compiled: bool,
2025-05-07T20:32:13.2792474Z     ) -> None:
2025-05-07T20:32:13.2792694Z         torch.manual_seed(2025)
2025-05-07T20:32:13.2792937Z     
2025-05-07T20:32:13.2793216Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.2793558Z     
2025-05-07T20:32:13.2793752Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.2794046Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.2794471Z         x = x_sign * x_clamp
2025-05-07T20:32:13.2794717Z         x0 = x[:, :D]
2025-05-07T20:32:13.2794936Z         x1 = x[:, D:]
2025-05-07T20:32:13.2795192Z     
2025-05-07T20:32:13.2795378Z         if contiguous:
2025-05-07T20:32:13.2795621Z             x0 = x0.contiguous()
2025-05-07T20:32:13.2795963Z             x1 = x1.contiguous()
2025-05-07T20:32:13.2796208Z     
2025-05-07T20:32:13.2796402Z         if scale_ub is not None:
2025-05-07T20:32:13.2796737Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.2797074Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.2797433Z             )
2025-05-07T20:32:13.2797630Z         else:
2025-05-07T20:32:13.2797846Z             scale_ub_tensor = None
2025-05-07T20:32:13.2798097Z     
2025-05-07T20:32:13.2798338Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.2798657Z             op = silu_mul_quant
2025-05-07T20:32:13.2798905Z             if compiled:
2025-05-07T20:32:13.2799154Z                 op = torch.compile(op)
2025-05-07T20:32:13.2799460Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.2799731Z     
2025-05-07T20:32:13.2799925Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.2800144Z 
2025-05-07T20:32:13.2800245Z moe/activation_test.py:117: 
2025-05-07T20:32:13.2800542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.2800869Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.2801154Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.2801860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.2802562Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.2803109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.2803802Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.2804483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.2805019Z     kernel = self.compile(
2025-05-07T20:32:13.2805577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.2806248Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.2806668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.2806894Z 
2025-05-07T20:32:13.2807106Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeaf13310>
2025-05-07T20:32:13.2808213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.2809720Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeaf78940>}
2025-05-07T20:32:13.2811093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.2812143Z context = <triton._C.libtriton.ir.context object at 0x7faaead5e4b0>
2025-05-07T20:32:13.2812440Z 
2025-05-07T20:32:13.2812614Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.2813147Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.2813622Z                            module_map=module_map)
2025-05-07T20:32:13.2813985Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.2814348Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.2814614Z E       ^
2025-05-07T20:32:13.2815089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.2815546Z 
2025-05-07T20:32:13.2815970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.2816497Z 
2025-05-07T20:32:13.2816604Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.2817024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.2817425Z     T=2048,
2025-05-07T20:32:13.2817662Z     D=5120,
2025-05-07T20:32:13.2817856Z     scale_ub=None,
2025-05-07T20:32:13.2818195Z     contiguous=True,
2025-05-07T20:32:13.2818428Z     compiled=False,
2025-05-07T20:32:13.2818637Z )
2025-05-07T20:32:13.2818965Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.2819459Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.2819735Z 
2025-05-07T20:32:13.2819811Z     @given(
2025-05-07T20:32:13.2820049Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.2820366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.2820679Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.2821063Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.2821396Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.2821682Z     )
2025-05-07T20:32:13.2822036Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.2822483Z     def test_silu_mul_quant(
2025-05-07T20:32:13.2822724Z         self,
2025-05-07T20:32:13.2822924Z         T: int,
2025-05-07T20:32:13.2823122Z         D: int,
2025-05-07T20:32:13.2823338Z         scale_ub: Optional[float],
2025-05-07T20:32:13.2823615Z         contiguous: bool,
2025-05-07T20:32:13.2823860Z         compiled: bool,
2025-05-07T20:32:13.2824080Z     ) -> None:
2025-05-07T20:32:13.2824296Z         torch.manual_seed(2025)
2025-05-07T20:32:13.2824543Z     
2025-05-07T20:32:13.2824822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.2825168Z     
2025-05-07T20:32:13.2825370Z >       x_sign = torch.sign(x)
2025-05-07T20:32:13.2827384Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.2829284Z 
2025-05-07T20:32:13.2829411Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:13.2829625Z 
2025-05-07T20:32:13.2829733Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.2830239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.2830647Z     T=16384,
2025-05-07T20:32:13.2830839Z     D=5120,
2025-05-07T20:32:13.2831035Z     scale_ub=None,
2025-05-07T20:32:13.2831248Z     contiguous=True,
2025-05-07T20:32:13.2831468Z     compiled=False,
2025-05-07T20:32:13.2831673Z )
2025-05-07T20:32:13.3799543Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.3800289Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.3800671Z 
2025-05-07T20:32:13.3800786Z     @given(
2025-05-07T20:32:13.3801115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.3801559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.3801946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.3802319Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.3802682Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.3802974Z     )
2025-05-07T20:32:13.3803326Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.3803775Z     def test_silu_mul_quant(
2025-05-07T20:32:13.3804018Z         self,
2025-05-07T20:32:13.3804207Z         T: int,
2025-05-07T20:32:13.3804437Z         D: int,
2025-05-07T20:32:13.3804657Z         scale_ub: Optional[float],
2025-05-07T20:32:13.3804929Z         contiguous: bool,
2025-05-07T20:32:13.3805164Z         compiled: bool,
2025-05-07T20:32:13.3805561Z     ) -> None:
2025-05-07T20:32:13.3805781Z         torch.manual_seed(2025)
2025-05-07T20:32:13.3806020Z     
2025-05-07T20:32:13.3806295Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.3808412Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.3816272Z 
2025-05-07T20:32:13.3816413Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.3816637Z 
2025-05-07T20:32:13.3816751Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.3817171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.3817574Z     T=4096,
2025-05-07T20:32:13.3817757Z     D=5120,
2025-05-07T20:32:13.3817945Z     scale_ub=None,
2025-05-07T20:32:13.3818227Z     contiguous=True,
2025-05-07T20:32:13.3818469Z     compiled=False,
2025-05-07T20:32:13.3818687Z )
2025-05-07T20:32:13.3819044Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.3819623Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.3819937Z 
2025-05-07T20:32:13.3820016Z     @given(
2025-05-07T20:32:13.3820259Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.3820618Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.3820962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.3821328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.3821693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.3822011Z     )
2025-05-07T20:32:13.3822407Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.3822925Z     def test_silu_mul_quant(
2025-05-07T20:32:13.3823229Z         self,
2025-05-07T20:32:13.3823431Z         T: int,
2025-05-07T20:32:13.3823633Z         D: int,
2025-05-07T20:32:13.3823863Z         scale_ub: Optional[float],
2025-05-07T20:32:13.3824159Z         contiguous: bool,
2025-05-07T20:32:13.3824564Z         compiled: bool,
2025-05-07T20:32:13.3824793Z     ) -> None:
2025-05-07T20:32:13.3825015Z         torch.manual_seed(2025)
2025-05-07T20:32:13.3825254Z     
2025-05-07T20:32:13.3825530Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.3827622Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.3829532Z 
2025-05-07T20:32:13.3829649Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.3829862Z 
2025-05-07T20:32:13.3829980Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.3830393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.3830805Z     T=2048,
2025-05-07T20:32:13.3830999Z     D=5120,
2025-05-07T20:32:13.3831191Z     scale_ub=None,
2025-05-07T20:32:13.3831411Z     contiguous=False,
2025-05-07T20:32:13.3831640Z     compiled=False,
2025-05-07T20:32:13.3831842Z )
2025-05-07T20:32:13.3832159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.3832703Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:13.3832973Z 
2025-05-07T20:32:13.3833060Z     @given(
2025-05-07T20:32:13.3833285Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.3833602Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.3833910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.3834237Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.3834571Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.3834858Z     )
2025-05-07T20:32:13.3835208Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.3835696Z     def test_silu_mul_quant(
2025-05-07T20:32:13.3835935Z         self,
2025-05-07T20:32:13.3836127Z         T: int,
2025-05-07T20:32:13.3836322Z         D: int,
2025-05-07T20:32:13.3836543Z         scale_ub: Optional[float],
2025-05-07T20:32:13.3836809Z         contiguous: bool,
2025-05-07T20:32:13.3837054Z         compiled: bool,
2025-05-07T20:32:13.3837272Z     ) -> None:
2025-05-07T20:32:13.3837487Z         torch.manual_seed(2025)
2025-05-07T20:32:13.3837728Z     
2025-05-07T20:32:13.3838003Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.3840102Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.3842003Z 
2025-05-07T20:32:13.3842123Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.3842341Z 
2025-05-07T20:32:13.3842441Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.3842868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.3843314Z     T=4096,
2025-05-07T20:32:13.3843498Z     D=7168,
2025-05-07T20:32:13.3843687Z     scale_ub=None,
2025-05-07T20:32:13.3843902Z     contiguous=True,
2025-05-07T20:32:13.3844122Z     compiled=True,
2025-05-07T20:32:13.3844325Z )
2025-05-07T20:32:13.3844730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.3845224Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:13.3845497Z 
2025-05-07T20:32:13.3845573Z     @given(
2025-05-07T20:32:13.3845803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.3846111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.3846420Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.3846751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.3847080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.3847365Z     )
2025-05-07T20:32:13.3847717Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.3848161Z     def test_silu_mul_quant(
2025-05-07T20:32:13.3848396Z         self,
2025-05-07T20:32:13.3848592Z         T: int,
2025-05-07T20:32:13.3848790Z         D: int,
2025-05-07T20:32:13.3849006Z         scale_ub: Optional[float],
2025-05-07T20:32:13.3849287Z         contiguous: bool,
2025-05-07T20:32:13.3849529Z         compiled: bool,
2025-05-07T20:32:13.3849749Z     ) -> None:
2025-05-07T20:32:13.3849970Z         torch.manual_seed(2025)
2025-05-07T20:32:13.3850211Z     
2025-05-07T20:32:13.3850484Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.3852580Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.3854594Z 
2025-05-07T20:32:13.3854719Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.3854939Z 
2025-05-07T20:32:13.3855043Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.3855461Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.3856101Z     T=2048,
2025-05-07T20:32:13.3856288Z     D=5120,
2025-05-07T20:32:13.3856479Z     scale_ub=1200.0,
2025-05-07T20:32:13.3856704Z     contiguous=False,
2025-05-07T20:32:13.3856931Z     compiled=False,
2025-05-07T20:32:13.3857134Z )
2025-05-07T20:32:13.3857453Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.3857959Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.3858312Z 
2025-05-07T20:32:13.3858397Z     @given(
2025-05-07T20:32:13.3858634Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.3858941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.3859246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.3859585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.3859914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.3860204Z     )
2025-05-07T20:32:13.3860556Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.3860997Z     def test_silu_mul_quant(
2025-05-07T20:32:13.3861234Z         self,
2025-05-07T20:32:13.3861427Z         T: int,
2025-05-07T20:32:13.3861616Z         D: int,
2025-05-07T20:32:13.3861833Z         scale_ub: Optional[float],
2025-05-07T20:32:13.3862107Z         contiguous: bool,
2025-05-07T20:32:13.3862343Z         compiled: bool,
2025-05-07T20:32:13.3862563Z     ) -> None:
2025-05-07T20:32:13.3862785Z         torch.manual_seed(2025)
2025-05-07T20:32:13.3863023Z     
2025-05-07T20:32:13.3863297Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.3865544Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.3867452Z 
2025-05-07T20:32:13.3867576Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.3867793Z 
2025-05-07T20:32:13.3867901Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.3868313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.3868716Z     T=4096,
2025-05-07T20:32:13.3868906Z     D=7168,
2025-05-07T20:32:13.3869097Z     scale_ub=1200.0,
2025-05-07T20:32:13.3869315Z     contiguous=True,
2025-05-07T20:32:13.3869534Z     compiled=False,
2025-05-07T20:32:13.3869744Z )
2025-05-07T20:32:13.5144372Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5145191Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:13.5145585Z 
2025-05-07T20:32:13.5145697Z     @given(
2025-05-07T20:32:13.5145957Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5146277Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5146585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5147038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5147379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5147665Z     )
2025-05-07T20:32:13.5148023Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5148478Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5148727Z         self,
2025-05-07T20:32:13.5148923Z         T: int,
2025-05-07T20:32:13.5149127Z         D: int,
2025-05-07T20:32:13.5149349Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5149625Z         contiguous: bool,
2025-05-07T20:32:13.5149872Z         compiled: bool,
2025-05-07T20:32:13.5150180Z     ) -> None:
2025-05-07T20:32:13.5150400Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5150648Z     
2025-05-07T20:32:13.5150926Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5153069Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.5155007Z 
2025-05-07T20:32:13.5155130Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.5155349Z 
2025-05-07T20:32:13.5155457Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5156066Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5156480Z     T=16384,
2025-05-07T20:32:13.5156672Z     D=7168,
2025-05-07T20:32:13.5156868Z     scale_ub=None,
2025-05-07T20:32:13.5157092Z     contiguous=False,
2025-05-07T20:32:13.5157322Z     compiled=True,
2025-05-07T20:32:13.5157529Z )
2025-05-07T20:32:13.5157853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5158360Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:13.5158648Z 
2025-05-07T20:32:13.5158726Z     @given(
2025-05-07T20:32:13.5158959Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5159278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5159708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5160044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5160382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5160665Z     )
2025-05-07T20:32:13.5161022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5161473Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5161714Z         self,
2025-05-07T20:32:13.5161909Z         T: int,
2025-05-07T20:32:13.5162113Z         D: int,
2025-05-07T20:32:13.5162327Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5162624Z         contiguous: bool,
2025-05-07T20:32:13.5162898Z         compiled: bool,
2025-05-07T20:32:13.5163118Z     ) -> None:
2025-05-07T20:32:13.5163334Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5163579Z     
2025-05-07T20:32:13.5163853Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5165976Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.5167968Z 
2025-05-07T20:32:13.5168098Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.5168313Z 
2025-05-07T20:32:13.5168419Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5168844Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5169256Z     T=4096,
2025-05-07T20:32:13.5169440Z     D=7168,
2025-05-07T20:32:13.5169637Z     scale_ub=None,
2025-05-07T20:32:13.5169862Z     contiguous=True,
2025-05-07T20:32:13.5170086Z     compiled=False,
2025-05-07T20:32:13.5170294Z )
2025-05-07T20:32:13.5170622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5171187Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.5171466Z 
2025-05-07T20:32:13.5171543Z     @given(
2025-05-07T20:32:13.5171776Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5172093Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5172406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5172741Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5173106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5173414Z     )
2025-05-07T20:32:13.5173767Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5174216Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5174455Z         self,
2025-05-07T20:32:13.5174654Z         T: int,
2025-05-07T20:32:13.5174854Z         D: int,
2025-05-07T20:32:13.5175070Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5175347Z         contiguous: bool,
2025-05-07T20:32:13.5175589Z         compiled: bool,
2025-05-07T20:32:13.5175813Z     ) -> None:
2025-05-07T20:32:13.5176034Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5176277Z     
2025-05-07T20:32:13.5176555Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5178798Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.5180727Z 
2025-05-07T20:32:13.5180847Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.5181067Z 
2025-05-07T20:32:13.5181170Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5181587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5181998Z     T=16384,
2025-05-07T20:32:13.5182191Z     D=7168,
2025-05-07T20:32:13.5182386Z     scale_ub=None,
2025-05-07T20:32:13.5182609Z     contiguous=True,
2025-05-07T20:32:13.5182831Z     compiled=False,
2025-05-07T20:32:13.5183035Z )
2025-05-07T20:32:13.5183356Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5183858Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.5184140Z 
2025-05-07T20:32:13.5184217Z     @given(
2025-05-07T20:32:13.5184447Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5184762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5185071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5185408Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5185744Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5186027Z     )
2025-05-07T20:32:13.5186381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5186835Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5187125Z         self,
2025-05-07T20:32:13.5187327Z         T: int,
2025-05-07T20:32:13.5187525Z         D: int,
2025-05-07T20:32:13.5187741Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5188019Z         contiguous: bool,
2025-05-07T20:32:13.5188263Z         compiled: bool,
2025-05-07T20:32:13.5188483Z     ) -> None:
2025-05-07T20:32:13.5188704Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5188949Z     
2025-05-07T20:32:13.5189229Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5191338Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.5193306Z 
2025-05-07T20:32:13.5193426Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.5193646Z 
2025-05-07T20:32:13.5193752Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5194173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5194579Z     T=16384,
2025-05-07T20:32:13.5194776Z     D=7168,
2025-05-07T20:32:13.5194968Z     scale_ub=1200.0,
2025-05-07T20:32:13.5195188Z     contiguous=True,
2025-05-07T20:32:13.5195415Z     compiled=False,
2025-05-07T20:32:13.5195620Z )
2025-05-07T20:32:13.5195941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5196443Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:13.5196725Z 
2025-05-07T20:32:13.5196802Z     @given(
2025-05-07T20:32:13.5197036Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5197351Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5197661Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5197996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5198324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5198614Z     )
2025-05-07T20:32:13.5198971Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5199494Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5199742Z         self,
2025-05-07T20:32:13.5199941Z         T: int,
2025-05-07T20:32:13.5200135Z         D: int,
2025-05-07T20:32:13.5200359Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5200632Z         contiguous: bool,
2025-05-07T20:32:13.5200877Z         compiled: bool,
2025-05-07T20:32:13.5201098Z     ) -> None:
2025-05-07T20:32:13.5201319Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5201566Z     
2025-05-07T20:32:13.5201837Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5204005Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.5205926Z 
2025-05-07T20:32:13.5206047Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.5206260Z 
2025-05-07T20:32:13.5206368Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5206790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5207192Z     T=128,
2025-05-07T20:32:13.5207426Z     D=5120,
2025-05-07T20:32:13.5207629Z     scale_ub=1200.0,
2025-05-07T20:32:13.5207852Z     contiguous=False,
2025-05-07T20:32:13.5208080Z     compiled=False,
2025-05-07T20:32:13.5208284Z )
2025-05-07T20:32:13.8522050Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8523103Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.8523404Z 
2025-05-07T20:32:13.8523491Z     @given(
2025-05-07T20:32:13.8523748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8524078Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8524543Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8524892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8525237Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8525531Z     )
2025-05-07T20:32:13.8525900Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8526366Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8526617Z         self,
2025-05-07T20:32:13.8526825Z         T: int,
2025-05-07T20:32:13.8527036Z         D: int,
2025-05-07T20:32:13.8527265Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8527552Z         contiguous: bool,
2025-05-07T20:32:13.8527805Z         compiled: bool,
2025-05-07T20:32:13.8528040Z     ) -> None:
2025-05-07T20:32:13.8528272Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8528532Z     
2025-05-07T20:32:13.8528814Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8529167Z     
2025-05-07T20:32:13.8529375Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.8529679Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.8529995Z         x = x_sign * x_clamp
2025-05-07T20:32:13.8530246Z         x0 = x[:, :D]
2025-05-07T20:32:13.8530472Z         x1 = x[:, D:]
2025-05-07T20:32:13.8530687Z     
2025-05-07T20:32:13.8530885Z         if contiguous:
2025-05-07T20:32:13.8531133Z             x0 = x0.contiguous()
2025-05-07T20:32:13.8531402Z             x1 = x1.contiguous()
2025-05-07T20:32:13.8531651Z     
2025-05-07T20:32:13.8531853Z         if scale_ub is not None:
2025-05-07T20:32:13.8532133Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.8532488Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.8532811Z             )
2025-05-07T20:32:13.8533008Z         else:
2025-05-07T20:32:13.8533350Z             scale_ub_tensor = None
2025-05-07T20:32:13.8533619Z     
2025-05-07T20:32:13.8533862Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.8534194Z             op = silu_mul_quant
2025-05-07T20:32:13.8534456Z             if compiled:
2025-05-07T20:32:13.8534715Z                 op = torch.compile(op)
2025-05-07T20:32:13.8535019Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8535308Z     
2025-05-07T20:32:13.8535514Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.8535687Z 
2025-05-07T20:32:13.8535791Z moe/activation_test.py:117: 
2025-05-07T20:32:13.8536098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8536438Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.8536729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8537448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.8538266Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.8538829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.8539536Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.8540229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.8540785Z     kernel = self.compile(
2025-05-07T20:32:13.8541447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.8542126Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.8542587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.8542820Z 
2025-05-07T20:32:13.8543042Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeaccdcc0>
2025-05-07T20:32:13.8544157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.8545628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeae04940>}
2025-05-07T20:32:13.8547017Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.8548081Z context = <triton._C.libtriton.ir.context object at 0x7faaeae249b0>
2025-05-07T20:32:13.8548380Z 
2025-05-07T20:32:13.8548557Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.8549097Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.8549586Z                            module_map=module_map)
2025-05-07T20:32:13.8549962Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.8550330Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.8550600Z E       ^
2025-05-07T20:32:13.8551084Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.8551547Z 
2025-05-07T20:32:13.8551984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.8552514Z 
2025-05-07T20:32:13.8552621Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.8553103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.8559007Z     T=2048,
2025-05-07T20:32:13.8559232Z     D=7168,
2025-05-07T20:32:13.8559429Z     scale_ub=None,
2025-05-07T20:32:13.8559648Z     contiguous=False,
2025-05-07T20:32:13.8560033Z     compiled=False,
2025-05-07T20:32:13.8560241Z )
2025-05-07T20:32:13.8560566Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8561074Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:13.8561350Z 
2025-05-07T20:32:13.8561427Z     @given(
2025-05-07T20:32:13.8561659Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8561977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8562283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8562634Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8563001Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8563284Z     )
2025-05-07T20:32:13.8563643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8564091Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8564334Z         self,
2025-05-07T20:32:13.8564528Z         T: int,
2025-05-07T20:32:13.8564723Z         D: int,
2025-05-07T20:32:13.8564942Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8565211Z         contiguous: bool,
2025-05-07T20:32:13.8565454Z         compiled: bool,
2025-05-07T20:32:13.8565684Z     ) -> None:
2025-05-07T20:32:13.8565903Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8566147Z     
2025-05-07T20:32:13.8566429Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8568554Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.8570552Z 
2025-05-07T20:32:13.8570674Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.8570960Z 
2025-05-07T20:32:13.8571069Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.8571496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.8571902Z     T=128,
2025-05-07T20:32:13.8572085Z     D=7168,
2025-05-07T20:32:13.8572285Z     scale_ub=1200.0,
2025-05-07T20:32:13.8572515Z     contiguous=True,
2025-05-07T20:32:13.8572741Z     compiled=True,
2025-05-07T20:32:13.8572951Z )
2025-05-07T20:32:13.8987311Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8988022Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:13.8988299Z 
2025-05-07T20:32:13.8988389Z     @given(
2025-05-07T20:32:13.8988626Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8988955Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8989269Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8989601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8989940Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8990236Z     )
2025-05-07T20:32:13.8990596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8991042Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8991288Z         self,
2025-05-07T20:32:13.8991490Z         T: int,
2025-05-07T20:32:13.8991685Z         D: int,
2025-05-07T20:32:13.8991909Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8992185Z         contiguous: bool,
2025-05-07T20:32:13.8992427Z         compiled: bool,
2025-05-07T20:32:13.8992657Z     ) -> None:
2025-05-07T20:32:13.8992881Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8993123Z     
2025-05-07T20:32:13.8993402Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8993903Z     
2025-05-07T20:32:13.8994106Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.8994406Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.8994719Z         x = x_sign * x_clamp
2025-05-07T20:32:13.8994961Z         x0 = x[:, :D]
2025-05-07T20:32:13.8995178Z         x1 = x[:, D:]
2025-05-07T20:32:13.8995395Z     
2025-05-07T20:32:13.8995579Z         if contiguous:
2025-05-07T20:32:13.8995814Z             x0 = x0.contiguous()
2025-05-07T20:32:13.8996075Z             x1 = x1.contiguous()
2025-05-07T20:32:13.8996316Z     
2025-05-07T20:32:13.8996515Z         if scale_ub is not None:
2025-05-07T20:32:13.8996794Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.8997132Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.8997445Z             )
2025-05-07T20:32:13.8997642Z         else:
2025-05-07T20:32:13.8997854Z             scale_ub_tensor = None
2025-05-07T20:32:13.8998112Z     
2025-05-07T20:32:13.8998359Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.8998675Z             op = silu_mul_quant
2025-05-07T20:32:13.8998933Z             if compiled:
2025-05-07T20:32:13.8999188Z                 op = torch.compile(op)
2025-05-07T20:32:13.8999484Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.8999764Z     
2025-05-07T20:32:13.8999963Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.9000129Z 
2025-05-07T20:32:13.9000237Z moe/activation_test.py:117: 
2025-05-07T20:32:13.9000529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9000932Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.9001219Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.9001788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.9002361Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.9003069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.9003779Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.9004317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.9005069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.9005742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.9006277Z     kernel = self.compile(
2025-05-07T20:32:13.9006825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.9007491Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.9007888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.9008114Z 
2025-05-07T20:32:13.9008332Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeaebb8e0>
2025-05-07T20:32:13.9009431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.9010833Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeae04dc0>}
2025-05-07T20:32:13.9012203Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.9013274Z context = <triton._C.libtriton.ir.context object at 0x7faaeae5c2f0>
2025-05-07T20:32:13.9013592Z 
2025-05-07T20:32:13.9013761Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.9014370Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.9014847Z                            module_map=module_map)
2025-05-07T20:32:13.9015213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.9015578Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.9015841Z E       ^
2025-05-07T20:32:13.9016310Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.9016776Z 
2025-05-07T20:32:13.9017197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.9017715Z 
2025-05-07T20:32:13.9017824Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.9018314Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.9018714Z     T=128,
2025-05-07T20:32:13.9018912Z     D=7168,
2025-05-07T20:32:13.9019118Z     scale_ub=1200.0,
2025-05-07T20:32:13.9019343Z     contiguous=True,
2025-05-07T20:32:13.9019573Z     compiled=False,
2025-05-07T20:32:13.9019785Z )
2025-05-07T20:32:13.9020109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.9020609Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:13.9020884Z 
2025-05-07T20:32:13.9020968Z     @given(
2025-05-07T20:32:13.9021205Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.9021591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.9021909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.9022242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.9022570Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.9022861Z     )
2025-05-07T20:32:13.9023216Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.9023658Z     def test_silu_mul_quant(
2025-05-07T20:32:13.9023903Z         self,
2025-05-07T20:32:13.9024102Z         T: int,
2025-05-07T20:32:13.9024302Z         D: int,
2025-05-07T20:32:13.9024518Z         scale_ub: Optional[float],
2025-05-07T20:32:13.9024846Z         contiguous: bool,
2025-05-07T20:32:13.9025092Z         compiled: bool,
2025-05-07T20:32:13.9025316Z     ) -> None:
2025-05-07T20:32:13.9025533Z         torch.manual_seed(2025)
2025-05-07T20:32:13.9025779Z     
2025-05-07T20:32:13.9026052Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.9026399Z     
2025-05-07T20:32:13.9026594Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.9026886Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.9028933Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.9030829Z 
2025-05-07T20:32:13.9030952Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:13.9031170Z 
2025-05-07T20:32:13.9031278Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.9031699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.9032103Z     T=128,
2025-05-07T20:32:13.9032290Z     D=5120,
2025-05-07T20:32:13.9032487Z     scale_ub=1200.0,
2025-05-07T20:32:13.9032707Z     contiguous=True,
2025-05-07T20:32:13.9032933Z     compiled=True,
2025-05-07T20:32:13.9033142Z )
2025-05-07T20:32:13.9033460Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.9034036Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:13.9034314Z 
2025-05-07T20:32:13.9034393Z     @given(
2025-05-07T20:32:13.9034621Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.9034934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.9035243Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.9035576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.9035909Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.9036201Z     )
2025-05-07T20:32:13.9036559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.9037003Z     def test_silu_mul_quant(
2025-05-07T20:32:13.9037249Z         self,
2025-05-07T20:32:13.9037447Z         T: int,
2025-05-07T20:32:13.9037643Z         D: int,
2025-05-07T20:32:13.9037864Z         scale_ub: Optional[float],
2025-05-07T20:32:13.9038139Z         contiguous: bool,
2025-05-07T20:32:13.9038379Z         compiled: bool,
2025-05-07T20:32:13.9038611Z     ) -> None:
2025-05-07T20:32:13.9038831Z         torch.manual_seed(2025)
2025-05-07T20:32:13.9039077Z     
2025-05-07T20:32:13.9039350Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.9039698Z     
2025-05-07T20:32:13.9039896Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.9040189Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.9042217Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.9044160Z 
2025-05-07T20:32:13.9044282Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:13.9044495Z 
2025-05-07T20:32:13.9044603Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.9045066Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.9045466Z     T=128,
2025-05-07T20:32:13.9045652Z     D=7168,
2025-05-07T20:32:13.9045846Z     scale_ub=None,
2025-05-07T20:32:13.9046058Z     contiguous=True,
2025-05-07T20:32:13.9046289Z     compiled=True,
2025-05-07T20:32:13.9046498Z )
2025-05-07T20:32:14.1356176Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1356679Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.1357047Z 
2025-05-07T20:32:14.1357160Z     @given(
2025-05-07T20:32:14.1357459Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1357778Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1358092Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1358429Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1358764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1359053Z     )
2025-05-07T20:32:14.1359414Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1359861Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1360104Z         self,
2025-05-07T20:32:14.1360302Z         T: int,
2025-05-07T20:32:14.1360508Z         D: int,
2025-05-07T20:32:14.1360728Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1361004Z         contiguous: bool,
2025-05-07T20:32:14.1361247Z         compiled: bool,
2025-05-07T20:32:14.1361471Z     ) -> None:
2025-05-07T20:32:14.1361696Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1361944Z     
2025-05-07T20:32:14.1362218Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1364487Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.1366404Z 
2025-05-07T20:32:14.1366527Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.1366747Z 
2025-05-07T20:32:14.1395968Z FAILED
2025-05-07T20:32:14.1396369Z 
2025-05-07T20:32:14.1396895Z =================================== FAILURES ===================================
2025-05-07T20:32:14.1397585Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:14.1398256Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:14.1399154Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:14.1399945Z   |     yield
2025-05-07T20:32:14.1400555Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run
2025-05-07T20:32:14.1401292Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:14.1402091Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
2025-05-07T20:32:14.1403099Z   |     method()
2025-05-07T20:32:14.1404013Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:14.1405031Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1405940Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:14.1406815Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:14.1407511Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:14.1408319Z   +-+---------------- 1 ----------------
2025-05-07T20:32:14.1408738Z     | Traceback (most recent call last):
2025-05-07T20:32:14.1409753Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:14.1410846Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1413832Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.1416748Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:14.1417382Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1417950Z     |     T=2048,
2025-05-07T20:32:14.1418442Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:14.1418935Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:14.1419453Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:14.1419982Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:14.1420410Z     | )
2025-05-07T20:32:14.1420676Z     | 
2025-05-07T20:32:14.1421554Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:14.1422424Z     +---------------- 2 ----------------
2025-05-07T20:32:14.1422850Z     | Traceback (most recent call last):
2025-05-07T20:32:14.1423863Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:14.1424959Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1427891Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.1430691Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:14.1431318Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1431907Z     |     T=128,
2025-05-07T20:32:14.1432197Z     |     D=7168,
2025-05-07T20:32:14.1432523Z     |     scale_ub=None,
2025-05-07T20:32:14.1432897Z     |     contiguous=True,
2025-05-07T20:32:14.1433258Z     |     compiled=True,
2025-05-07T20:32:14.1433571Z     | )
2025-05-07T20:32:14.1434716Z     | 
2025-05-07T20:32:14.1435472Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:14.1436319Z     +---------------- 3 ----------------
2025-05-07T20:32:14.1436743Z     | Traceback (most recent call last):
2025-05-07T20:32:14.1437744Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:14.1438714Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1440828Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.1442964Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:14.1443415Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1443834Z     |     T=128,
2025-05-07T20:32:14.1444038Z     |     D=5120,
2025-05-07T20:32:14.1444264Z     |     scale_ub=1200.0,
2025-05-07T20:32:14.1444513Z     |     contiguous=True,
2025-05-07T20:32:14.1444756Z     |     compiled=True,
2025-05-07T20:32:14.1444993Z     | )
2025-05-07T20:32:14.1445179Z     | 
2025-05-07T20:32:14.1445707Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:14.1446327Z     +---------------- 4 ----------------
2025-05-07T20:32:14.1446629Z     | Traceback (most recent call last):
2025-05-07T20:32:14.1447364Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:14.1448086Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.1448988Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:14.1450091Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1451258Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:14.1452411Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.1453304Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:14.1454380Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1455489Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:14.1456893Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1458155Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:14.1459329Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1460483Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:14.1461467Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.1462475Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:14.1463449Z     |     fn()
2025-05-07T20:32:14.1464281Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:14.1465205Z     |     self.fn.run(
2025-05-07T20:32:14.1465981Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:14.1466836Z     |     kernel = self.compile(
2025-05-07T20:32:14.1467720Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:14.1468841Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1469871Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:14.1471020Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1471743Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1472253Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.1472673Z     | ^
2025-05-07T20:32:14.1473355Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1474181Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:14.1474769Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:14.1475546Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1476181Z     |     T=1,  # or any other generated value
2025-05-07T20:32:14.1476638Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:14.1477122Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:14.1477640Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:14.1478178Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:14.1478617Z     | )
2025-05-07T20:32:14.1478881Z     | 
2025-05-07T20:32:14.1479657Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:14.1480526Z     +------------------------------------
2025-05-07T20:32:14.1481202Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:14.1500843Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1501429Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1501988Z     T=1,
2025-05-07T20:32:14.1502245Z     D=5120,
2025-05-07T20:32:14.1502509Z     scale_ub=None,
2025-05-07T20:32:14.1502857Z     contiguous=True,
2025-05-07T20:32:14.1503166Z     compiled=True,
2025-05-07T20:32:14.1503448Z )
2025-05-07T20:32:14.1503882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1504537Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.1504898Z 
2025-05-07T20:32:14.1505015Z     @given(
2025-05-07T20:32:14.1505344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1505808Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1506247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1506710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1507170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1507599Z     )
2025-05-07T20:32:14.1508104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1508726Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1509055Z         self,
2025-05-07T20:32:14.1509326Z         T: int,
2025-05-07T20:32:14.1509609Z         D: int,
2025-05-07T20:32:14.1509918Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1510479Z         contiguous: bool,
2025-05-07T20:32:14.1510829Z         compiled: bool,
2025-05-07T20:32:14.1511144Z     ) -> None:
2025-05-07T20:32:14.1511444Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1511775Z     
2025-05-07T20:32:14.1512150Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1512621Z     
2025-05-07T20:32:14.1512880Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1513289Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1513713Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1514041Z         x0 = x[:, :D]
2025-05-07T20:32:14.1514344Z         x1 = x[:, D:]
2025-05-07T20:32:14.1514705Z     
2025-05-07T20:32:14.1514969Z         if contiguous:
2025-05-07T20:32:14.1515301Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1515669Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1516018Z     
2025-05-07T20:32:14.1516306Z         if scale_ub is not None:
2025-05-07T20:32:14.1516700Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1517184Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1517630Z             )
2025-05-07T20:32:14.1517907Z         else:
2025-05-07T20:32:14.1518195Z             scale_ub_tensor = None
2025-05-07T20:32:14.1518553Z     
2025-05-07T20:32:14.1518884Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1519332Z             op = silu_mul_quant
2025-05-07T20:32:14.1519688Z             if compiled:
2025-05-07T20:32:14.1520046Z                 op = torch.compile(op)
2025-05-07T20:32:14.1520467Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1520854Z     
2025-05-07T20:32:14.1521133Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.1521539Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.1521948Z     
2025-05-07T20:32:14.1522297Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1522819Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.1523247Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.1523705Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.1524188Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1524647Z     
2025-05-07T20:32:14.1524945Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.1525220Z 
2025-05-07T20:32:14.1525369Z moe/activation_test.py:126: 
2025-05-07T20:32:14.1525885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1526368Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.1526836Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1527945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.1528999Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.1529763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1530705Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1531672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.1532791Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1533901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.1534976Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1536017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.1536940Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.1537807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.1538704Z     fn()
2025-05-07T20:32:14.1539426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.1540257Z     self.fn.run(
2025-05-07T20:32:14.1540920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1541679Z     kernel = self.compile(
2025-05-07T20:32:14.1542488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1543498Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1544039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1544373Z 
2025-05-07T20:32:14.1544659Z self = <triton.compiler.compiler.ASTSource object at 0x7facc39e8040>
2025-05-07T20:32:14.1546178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1548142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc8283400>}
2025-05-07T20:32:14.1550049Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1551500Z context = <triton._C.libtriton.ir.context object at 0x7facc9f914b0>
2025-05-07T20:32:14.1551904Z 
2025-05-07T20:32:14.1552141Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1552936Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1553598Z                            module_map=module_map)
2025-05-07T20:32:14.1554106Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1554622Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.1555003Z E       ^
2025-05-07T20:32:14.1556139Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1556809Z 
2025-05-07T20:32:14.1557652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1558393Z 
2025-05-07T20:32:14.1558548Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1559130Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1559677Z     T=2048,
2025-05-07T20:32:14.1559941Z     D=5120,
2025-05-07T20:32:14.1560205Z     scale_ub=1200.0,
2025-05-07T20:32:14.1560530Z     contiguous=True,
2025-05-07T20:32:14.1560867Z     compiled=False,
2025-05-07T20:32:14.1561153Z )
2025-05-07T20:32:14.1561600Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1562311Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.1562720Z 
2025-05-07T20:32:14.1562832Z     @given(
2025-05-07T20:32:14.1563172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1563625Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1564086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1564557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1565028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1565418Z     )
2025-05-07T20:32:14.1565920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1566550Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1566899Z         self,
2025-05-07T20:32:14.1567188Z         T: int,
2025-05-07T20:32:14.1567565Z         D: int,
2025-05-07T20:32:14.1567878Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1568275Z         contiguous: bool,
2025-05-07T20:32:14.1568630Z         compiled: bool,
2025-05-07T20:32:14.1568951Z     ) -> None:
2025-05-07T20:32:14.1569252Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1569591Z     
2025-05-07T20:32:14.1569977Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1570465Z     
2025-05-07T20:32:14.1570738Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1571154Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1571700Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1572041Z         x0 = x[:, :D]
2025-05-07T20:32:14.1572326Z         x1 = x[:, D:]
2025-05-07T20:32:14.1572622Z     
2025-05-07T20:32:14.1572893Z         if contiguous:
2025-05-07T20:32:14.1573206Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1573562Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1573920Z     
2025-05-07T20:32:14.1574192Z         if scale_ub is not None:
2025-05-07T20:32:14.1574581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1575057Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1575504Z             )
2025-05-07T20:32:14.1575783Z         else:
2025-05-07T20:32:14.1576091Z             scale_ub_tensor = None
2025-05-07T20:32:14.1576468Z     
2025-05-07T20:32:14.1576804Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1577266Z             op = silu_mul_quant
2025-05-07T20:32:14.1577641Z             if compiled:
2025-05-07T20:32:14.1578098Z                 op = torch.compile(op)
2025-05-07T20:32:14.1578550Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1578958Z     
2025-05-07T20:32:14.1579225Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1579464Z 
2025-05-07T20:32:14.1579607Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1580022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1580493Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1580888Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1581857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1582814Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1583671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1584622Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1585558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1586307Z     kernel = self.compile(
2025-05-07T20:32:14.1587050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1587968Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1588531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1588844Z 
2025-05-07T20:32:14.1589132Z self = <triton.compiler.compiler.ASTSource object at 0x7facc3fbe4a0>
2025-05-07T20:32:14.1590623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1592602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc3962e60>}
2025-05-07T20:32:14.1594525Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1595738Z context = <triton._C.libtriton.ir.context object at 0x7facc2abcd70>
2025-05-07T20:32:14.1596036Z 
2025-05-07T20:32:14.1596210Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1596827Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1597455Z                            module_map=module_map)
2025-05-07T20:32:14.1597983Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1598458Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1598819Z E       ^
2025-05-07T20:32:14.1599387Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1599929Z 
2025-05-07T20:32:14.1600352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1600879Z 
2025-05-07T20:32:14.1600983Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1601407Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1601811Z     T=2048,
2025-05-07T20:32:14.1601995Z     D=5120,
2025-05-07T20:32:14.1602188Z     scale_ub=1200.0,
2025-05-07T20:32:14.1602415Z     contiguous=True,
2025-05-07T20:32:14.1602632Z     compiled=True,
2025-05-07T20:32:14.1602840Z )
2025-05-07T20:32:14.1603174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1603688Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.1603960Z 
2025-05-07T20:32:14.1604041Z     @given(
2025-05-07T20:32:14.1604275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1604594Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1604903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1605239Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1605575Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1605867Z     )
2025-05-07T20:32:14.1606216Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1606662Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1606911Z         self,
2025-05-07T20:32:14.1607103Z         T: int,
2025-05-07T20:32:14.1607306Z         D: int,
2025-05-07T20:32:14.1607530Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1607914Z         contiguous: bool,
2025-05-07T20:32:14.1608162Z         compiled: bool,
2025-05-07T20:32:14.1608392Z     ) -> None:
2025-05-07T20:32:14.1608606Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1608856Z     
2025-05-07T20:32:14.1609135Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1609471Z     
2025-05-07T20:32:14.1609664Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1609960Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1610266Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1610510Z         x0 = x[:, :D]
2025-05-07T20:32:14.1610729Z         x1 = x[:, D:]
2025-05-07T20:32:14.1610940Z     
2025-05-07T20:32:14.1611122Z         if contiguous:
2025-05-07T20:32:14.1611356Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1611619Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1611855Z     
2025-05-07T20:32:14.1612050Z         if scale_ub is not None:
2025-05-07T20:32:14.1612334Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1612722Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1613034Z             )
2025-05-07T20:32:14.1613230Z         else:
2025-05-07T20:32:14.1613441Z             scale_ub_tensor = None
2025-05-07T20:32:14.1613696Z     
2025-05-07T20:32:14.1613931Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1614242Z             op = silu_mul_quant
2025-05-07T20:32:14.1614495Z             if compiled:
2025-05-07T20:32:14.1614748Z                 op = torch.compile(op)
2025-05-07T20:32:14.1615095Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1615374Z     
2025-05-07T20:32:14.1615572Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.1615863Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.1616148Z     
2025-05-07T20:32:14.1616392Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1616733Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.1617029Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.1617349Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.1617717Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1618189Z     
2025-05-07T20:32:14.1618399Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.1618595Z 
2025-05-07T20:32:14.1618702Z moe/activation_test.py:126: 
2025-05-07T20:32:14.1618993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1619334Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.1619671Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1620468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.1621231Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.1621792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1622488Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1623191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.1623920Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1624682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.1625441Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1626181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.1626823Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.1627514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.1628042Z     fn()
2025-05-07T20:32:14.1628553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.1629144Z     self.fn.run(
2025-05-07T20:32:14.1629625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1630166Z     kernel = self.compile(
2025-05-07T20:32:14.1630710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1631374Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1631775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1632003Z 
2025-05-07T20:32:14.1632215Z self = <triton.compiler.compiler.ASTSource object at 0x7facc2a9d7b0>
2025-05-07T20:32:14.1633318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1634723Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc243d6c0>}
2025-05-07T20:32:14.1636089Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1637774Z context = <triton._C.libtriton.ir.context object at 0x7facc22b03b0>
2025-05-07T20:32:14.1638066Z 
2025-05-07T20:32:14.1638235Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1638766Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1639246Z                            module_map=module_map)
2025-05-07T20:32:14.1639613Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1640023Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.1640294Z E       ^
2025-05-07T20:32:14.1640769Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1641225Z 
2025-05-07T20:32:14.1641649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1642174Z 
2025-05-07T20:32:14.1642283Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1642752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1643156Z     T=16384,
2025-05-07T20:32:14.1643348Z     D=7168,
2025-05-07T20:32:14.1643543Z     scale_ub=1200.0,
2025-05-07T20:32:14.1643770Z     contiguous=False,
2025-05-07T20:32:14.1643993Z     compiled=False,
2025-05-07T20:32:14.1644206Z )
2025-05-07T20:32:14.1644533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1645040Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.1645335Z 
2025-05-07T20:32:14.1645412Z     @given(
2025-05-07T20:32:14.1645653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1645965Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1646279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1646616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1646953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1647236Z     )
2025-05-07T20:32:14.1647593Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1648042Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1648283Z         self,
2025-05-07T20:32:14.1648481Z         T: int,
2025-05-07T20:32:14.1648761Z         D: int,
2025-05-07T20:32:14.1648980Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1649257Z         contiguous: bool,
2025-05-07T20:32:14.1649500Z         compiled: bool,
2025-05-07T20:32:14.1649719Z     ) -> None:
2025-05-07T20:32:14.1649939Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1650182Z     
2025-05-07T20:32:14.1650456Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1650800Z     
2025-05-07T20:32:14.1651001Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1651292Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1651602Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1651842Z         x0 = x[:, :D]
2025-05-07T20:32:14.1652063Z         x1 = x[:, D:]
2025-05-07T20:32:14.1652264Z     
2025-05-07T20:32:14.1652456Z         if contiguous:
2025-05-07T20:32:14.1652722Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1652995Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1653240Z     
2025-05-07T20:32:14.1653441Z         if scale_ub is not None:
2025-05-07T20:32:14.1653712Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1654053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1654363Z             )
2025-05-07T20:32:14.1654553Z         else:
2025-05-07T20:32:14.1654767Z             scale_ub_tensor = None
2025-05-07T20:32:14.1655019Z     
2025-05-07T20:32:14.1655250Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1655833Z             op = silu_mul_quant
2025-05-07T20:32:14.1656275Z             if compiled:
2025-05-07T20:32:14.1656522Z                 op = torch.compile(op)
2025-05-07T20:32:14.1656822Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1657099Z     
2025-05-07T20:32:14.1657303Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1657473Z 
2025-05-07T20:32:14.1657575Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1657886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1658290Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1658575Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1659366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1660074Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1660615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1661316Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1661994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1662581Z     kernel = self.compile(
2025-05-07T20:32:14.1663132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1663811Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1664217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1664447Z 
2025-05-07T20:32:14.1664666Z self = <triton.compiler.compiler.ASTSource object at 0x7facc23d68c0>
2025-05-07T20:32:14.1665759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1667156Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc243d510>}
2025-05-07T20:32:14.1668530Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1669694Z context = <triton._C.libtriton.ir.context object at 0x7facc22e4230>
2025-05-07T20:32:14.1669990Z 
2025-05-07T20:32:14.1670168Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1670696Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1671175Z                            module_map=module_map)
2025-05-07T20:32:14.1671550Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1671908Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1672173Z E       ^
2025-05-07T20:32:14.1672699Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1673154Z 
2025-05-07T20:32:14.1673580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1674097Z 
2025-05-07T20:32:14.1674211Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1674635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1675041Z     T=1,
2025-05-07T20:32:14.1675227Z     D=7168,
2025-05-07T20:32:14.1675427Z     scale_ub=None,
2025-05-07T20:32:14.1675645Z     contiguous=True,
2025-05-07T20:32:14.1675867Z     compiled=True,
2025-05-07T20:32:14.1676076Z )
2025-05-07T20:32:14.1676402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1676885Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.1677199Z 
2025-05-07T20:32:14.1677276Z     @given(
2025-05-07T20:32:14.1677518Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1677836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1678144Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1678480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1678821Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1679103Z     )
2025-05-07T20:32:14.1679460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1679978Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1687764Z         self,
2025-05-07T20:32:14.1688000Z         T: int,
2025-05-07T20:32:14.1688208Z         D: int,
2025-05-07T20:32:14.1688426Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1688713Z         contiguous: bool,
2025-05-07T20:32:14.1688970Z         compiled: bool,
2025-05-07T20:32:14.1689203Z     ) -> None:
2025-05-07T20:32:14.1689428Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1689764Z     
2025-05-07T20:32:14.1690133Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1690489Z     
2025-05-07T20:32:14.1690692Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1690995Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1691306Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1691569Z         x0 = x[:, :D]
2025-05-07T20:32:14.1691795Z         x1 = x[:, D:]
2025-05-07T20:32:14.1692000Z     
2025-05-07T20:32:14.1692196Z         if contiguous:
2025-05-07T20:32:14.1692439Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1692699Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1692949Z     
2025-05-07T20:32:14.1693150Z         if scale_ub is not None:
2025-05-07T20:32:14.1693424Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1693771Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1694092Z             )
2025-05-07T20:32:14.1694289Z         else:
2025-05-07T20:32:14.1694515Z             scale_ub_tensor = None
2025-05-07T20:32:14.1694775Z     
2025-05-07T20:32:14.1695010Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1695333Z             op = silu_mul_quant
2025-05-07T20:32:14.1695594Z             if compiled:
2025-05-07T20:32:14.1695853Z                 op = torch.compile(op)
2025-05-07T20:32:14.1696277Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1696565Z     
2025-05-07T20:32:14.1696767Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.1697059Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.1697434Z     
2025-05-07T20:32:14.1697722Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1698139Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.1698444Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.1698772Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.1699131Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1699457Z     
2025-05-07T20:32:14.1699750Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.1699956Z 
2025-05-07T20:32:14.1700068Z moe/activation_test.py:126: 
2025-05-07T20:32:14.1700365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1700830Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.1701289Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1702270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.1703057Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.1703620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1704425Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1705129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.1705872Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1706651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.1707419Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1708157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.1708865Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.1709477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.1710010Z     fn()
2025-05-07T20:32:14.1710532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.1711125Z     self.fn.run(
2025-05-07T20:32:14.1711599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1712140Z     kernel = self.compile(
2025-05-07T20:32:14.1712698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1713362Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1713768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1713999Z 
2025-05-07T20:32:14.1714223Z self = <triton.compiler.compiler.ASTSource object at 0x7facc23d5c60>
2025-05-07T20:32:14.1715337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1716751Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc243d7e0>}
2025-05-07T20:32:14.1718209Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1719273Z context = <triton._C.libtriton.ir.context object at 0x7fac9dc765f0>
2025-05-07T20:32:14.1719568Z 
2025-05-07T20:32:14.1719742Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1720269Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1720751Z                            module_map=module_map)
2025-05-07T20:32:14.1721133Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1721502Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.1721771Z E       ^
2025-05-07T20:32:14.1722245Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1722704Z 
2025-05-07T20:32:14.1723141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1723665Z 
2025-05-07T20:32:14.1723780Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1724197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1724617Z     T=4096,
2025-05-07T20:32:14.1724815Z     D=5120,
2025-05-07T20:32:14.1725010Z     scale_ub=None,
2025-05-07T20:32:14.1725234Z     contiguous=False,
2025-05-07T20:32:14.1725471Z     compiled=False,
2025-05-07T20:32:14.1725684Z )
2025-05-07T20:32:14.1726019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1726573Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.1726848Z 
2025-05-07T20:32:14.1726930Z     @given(
2025-05-07T20:32:14.1727168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1727491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1727809Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1728147Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1728484Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1728775Z     )
2025-05-07T20:32:14.1729173Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1729625Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1729875Z         self,
2025-05-07T20:32:14.1730071Z         T: int,
2025-05-07T20:32:14.1730275Z         D: int,
2025-05-07T20:32:14.1730499Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1730776Z         contiguous: bool,
2025-05-07T20:32:14.1731024Z         compiled: bool,
2025-05-07T20:32:14.1731254Z     ) -> None:
2025-05-07T20:32:14.1731470Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1731717Z     
2025-05-07T20:32:14.1732001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1732350Z     
2025-05-07T20:32:14.1732568Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1732891Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1733203Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1733440Z         x0 = x[:, :D]
2025-05-07T20:32:14.1733666Z         x1 = x[:, D:]
2025-05-07T20:32:14.1733876Z     
2025-05-07T20:32:14.1734063Z         if contiguous:
2025-05-07T20:32:14.1734297Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1734559Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1734798Z     
2025-05-07T20:32:14.1734997Z         if scale_ub is not None:
2025-05-07T20:32:14.1735279Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1735615Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1735930Z             )
2025-05-07T20:32:14.1736130Z         else:
2025-05-07T20:32:14.1736343Z             scale_ub_tensor = None
2025-05-07T20:32:14.1736599Z     
2025-05-07T20:32:14.1736835Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1737156Z             op = silu_mul_quant
2025-05-07T20:32:14.1737490Z             if compiled:
2025-05-07T20:32:14.1737745Z                 op = torch.compile(op)
2025-05-07T20:32:14.1738146Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1738425Z     
2025-05-07T20:32:14.1738624Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1738790Z 
2025-05-07T20:32:14.1738898Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1739190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1739529Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1739822Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1740521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1741231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1741776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1742484Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1743204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1743751Z     kernel = self.compile(
2025-05-07T20:32:14.1744308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1744978Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1745372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1745653Z 
2025-05-07T20:32:14.1745865Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9dcde080>
2025-05-07T20:32:14.1746971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1748387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc23c5090>}
2025-05-07T20:32:14.1749800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1750853Z context = <triton._C.libtriton.ir.context object at 0x7fac9dce6e70>
2025-05-07T20:32:14.1751156Z 
2025-05-07T20:32:14.1751327Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1751861Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1752338Z                            module_map=module_map)
2025-05-07T20:32:14.1752762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1753124Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1753389Z E       ^
2025-05-07T20:32:14.1753865Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1754338Z 
2025-05-07T20:32:14.1754762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1755283Z 
2025-05-07T20:32:14.1755394Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1756118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1756532Z     T=4096,
2025-05-07T20:32:14.1756728Z     D=7168,
2025-05-07T20:32:14.1756922Z     scale_ub=None,
2025-05-07T20:32:14.1757148Z     contiguous=False,
2025-05-07T20:32:14.1757382Z     compiled=False,
2025-05-07T20:32:14.1757593Z )
2025-05-07T20:32:14.1757916Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1758609Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.1758891Z 
2025-05-07T20:32:14.1758978Z     @given(
2025-05-07T20:32:14.1759212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1759535Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1759850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1760181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1760518Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1760812Z     )
2025-05-07T20:32:14.1761168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1761619Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1761868Z         self,
2025-05-07T20:32:14.1762071Z         T: int,
2025-05-07T20:32:14.1762271Z         D: int,
2025-05-07T20:32:14.1762519Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1762821Z         contiguous: bool,
2025-05-07T20:32:14.1763062Z         compiled: bool,
2025-05-07T20:32:14.1763297Z     ) -> None:
2025-05-07T20:32:14.1763523Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1763778Z     
2025-05-07T20:32:14.1764056Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1764412Z     
2025-05-07T20:32:14.1764614Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1764907Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1765224Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1765470Z         x0 = x[:, :D]
2025-05-07T20:32:14.1765685Z         x1 = x[:, D:]
2025-05-07T20:32:14.1765997Z     
2025-05-07T20:32:14.1766189Z         if contiguous:
2025-05-07T20:32:14.1766420Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1766689Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1766939Z     
2025-05-07T20:32:14.1767133Z         if scale_ub is not None:
2025-05-07T20:32:14.1767415Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1767763Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1768080Z             )
2025-05-07T20:32:14.1768280Z         else:
2025-05-07T20:32:14.1768501Z             scale_ub_tensor = None
2025-05-07T20:32:14.1768834Z     
2025-05-07T20:32:14.1769067Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1769387Z             op = silu_mul_quant
2025-05-07T20:32:14.1769644Z             if compiled:
2025-05-07T20:32:14.1769892Z                 op = torch.compile(op)
2025-05-07T20:32:14.1770196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1770484Z     
2025-05-07T20:32:14.1770678Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1770852Z 
2025-05-07T20:32:14.1770954Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1771256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1771585Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1771874Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1772603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1773349Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1773900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1774600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1775281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1775825Z     kernel = self.compile(
2025-05-07T20:32:14.1776380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1777051Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1777453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1777683Z 
2025-05-07T20:32:14.1777977Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9dcdc9a0>
2025-05-07T20:32:14.1779198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1780611Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc23c6560>}
2025-05-07T20:32:14.1781994Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1783052Z context = <triton._C.libtriton.ir.context object at 0x7fac9d7e0a30>
2025-05-07T20:32:14.1783345Z 
2025-05-07T20:32:14.1783517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1784057Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1784537Z                            module_map=module_map)
2025-05-07T20:32:14.1784906Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1785272Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1785543Z E       ^
2025-05-07T20:32:14.1786021Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1786074Z 
2025-05-07T20:32:14.1786498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1786503Z 
2025-05-07T20:32:14.1786610Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1786846Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1786926Z     T=128,
2025-05-07T20:32:14.1787005Z     D=7168,
2025-05-07T20:32:14.1787105Z     scale_ub=None,
2025-05-07T20:32:14.1787195Z     contiguous=False,
2025-05-07T20:32:14.1787288Z     compiled=True,
2025-05-07T20:32:14.1787361Z )
2025-05-07T20:32:14.1787631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1787812Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.1787817Z 
2025-05-07T20:32:14.1787896Z     @given(
2025-05-07T20:32:14.1788017Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1788129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1788248Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1788367Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1788492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1788569Z     )
2025-05-07T20:32:14.1788826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1788925Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1789005Z         self,
2025-05-07T20:32:14.1789091Z         T: int,
2025-05-07T20:32:14.1789168Z         D: int,
2025-05-07T20:32:14.1789273Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1789371Z         contiguous: bool,
2025-05-07T20:32:14.1789460Z         compiled: bool,
2025-05-07T20:32:14.1789542Z     ) -> None:
2025-05-07T20:32:14.1789644Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1789718Z     
2025-05-07T20:32:14.1789891Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1789975Z     
2025-05-07T20:32:14.1790071Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1790208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1790298Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1790380Z         x0 = x[:, :D]
2025-05-07T20:32:14.1790466Z         x1 = x[:, D:]
2025-05-07T20:32:14.1790540Z     
2025-05-07T20:32:14.1790626Z         if contiguous:
2025-05-07T20:32:14.1790806Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1790897Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1790971Z     
2025-05-07T20:32:14.1791069Z         if scale_ub is not None:
2025-05-07T20:32:14.1791179Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1791316Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1791399Z             )
2025-05-07T20:32:14.1791477Z         else:
2025-05-07T20:32:14.1791578Z             scale_ub_tensor = None
2025-05-07T20:32:14.1791653Z     
2025-05-07T20:32:14.1791788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1791886Z             op = silu_mul_quant
2025-05-07T20:32:14.1791971Z             if compiled:
2025-05-07T20:32:14.1792074Z                 op = torch.compile(op)
2025-05-07T20:32:14.1792188Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1792265Z     
2025-05-07T20:32:14.1792357Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.1792519Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.1792602Z     
2025-05-07T20:32:14.1792756Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1792868Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.1792972Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.1793102Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.1793247Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1793325Z     
2025-05-07T20:32:14.1793432Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.1793480Z 
2025-05-07T20:32:14.1793582Z moe/activation_test.py:126: 
2025-05-07T20:32:14.1793710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1793824Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.1793962Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1794552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.1794656Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.1795065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1795297Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1795673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.1795938Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1796352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.1796609Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1797005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.1797178Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.1797529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.1797615Z     fn()
2025-05-07T20:32:14.1798026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.1798117Z     self.fn.run(
2025-05-07T20:32:14.1798466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1798562Z     kernel = self.compile(
2025-05-07T20:32:14.1798956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1799136Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1799340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1799345Z 
2025-05-07T20:32:14.1799565Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d600e20>
2025-05-07T20:32:14.1800453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1801026Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc23c7d00>}
2025-05-07T20:32:14.1801800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1802003Z context = <triton._C.libtriton.ir.context object at 0x7fac9d8c88b0>
2025-05-07T20:32:14.1802010Z 
2025-05-07T20:32:14.1802227Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1802564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1802701Z                            module_map=module_map)
2025-05-07T20:32:14.1802881Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1802989Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.1803073Z E       ^
2025-05-07T20:32:14.1803583Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1803590Z 
2025-05-07T20:32:14.1804027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1804032Z 
2025-05-07T20:32:14.1804148Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1804451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1804541Z     T=128,
2025-05-07T20:32:14.1804619Z     D=7168,
2025-05-07T20:32:14.1804704Z     scale_ub=None,
2025-05-07T20:32:14.1804862Z     contiguous=False,
2025-05-07T20:32:14.1804948Z     compiled=False,
2025-05-07T20:32:14.1805030Z )
2025-05-07T20:32:14.1805258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1805436Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.1805441Z 
2025-05-07T20:32:14.1805531Z     @given(
2025-05-07T20:32:14.1805653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1805755Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1805883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1806006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1806124Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1806212Z     )
2025-05-07T20:32:14.1806475Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1806577Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1806656Z         self,
2025-05-07T20:32:14.1806740Z         T: int,
2025-05-07T20:32:14.1806826Z         D: int,
2025-05-07T20:32:14.1806928Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1807018Z         contiguous: bool,
2025-05-07T20:32:14.1807110Z         compiled: bool,
2025-05-07T20:32:14.1807189Z     ) -> None:
2025-05-07T20:32:14.1807289Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1807370Z     
2025-05-07T20:32:14.1807543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1807620Z     
2025-05-07T20:32:14.1807719Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1807849Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1807944Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1808026Z         x0 = x[:, :D]
2025-05-07T20:32:14.1808107Z         x1 = x[:, D:]
2025-05-07T20:32:14.1808290Z     
2025-05-07T20:32:14.1808379Z         if contiguous:
2025-05-07T20:32:14.1808472Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1808568Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1808645Z     
2025-05-07T20:32:14.1808740Z         if scale_ub is not None:
2025-05-07T20:32:14.1808858Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1808996Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1809073Z             )
2025-05-07T20:32:14.1809161Z         else:
2025-05-07T20:32:14.1809257Z             scale_ub_tensor = None
2025-05-07T20:32:14.1809341Z     
2025-05-07T20:32:14.1809474Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1809566Z             op = silu_mul_quant
2025-05-07T20:32:14.1809662Z             if compiled:
2025-05-07T20:32:14.1809764Z                 op = torch.compile(op)
2025-05-07T20:32:14.1809876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1809957Z     
2025-05-07T20:32:14.1810055Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1810060Z 
2025-05-07T20:32:14.1810161Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1810305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1810408Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1810519Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1811038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1811185Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1811560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1811788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1812137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1812247Z     kernel = self.compile(
2025-05-07T20:32:14.1812638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1812864Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1812992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1812997Z 
2025-05-07T20:32:14.1813209Z self = <triton.compiler.compiler.ASTSource object at 0x7facc211c100>
2025-05-07T20:32:14.1814018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1814538Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc23c72e0>}
2025-05-07T20:32:14.1815313Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1815515Z context = <triton._C.libtriton.ir.context object at 0x7fac9d6f42b0>
2025-05-07T20:32:14.1815520Z 
2025-05-07T20:32:14.1815693Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1815971Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1816085Z                            module_map=module_map)
2025-05-07T20:32:14.1816259Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1816362Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1816442Z E       ^
2025-05-07T20:32:14.1816813Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1816818Z 
2025-05-07T20:32:14.1817364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1817372Z 
2025-05-07T20:32:14.1817490Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1817722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1817803Z     T=4096,
2025-05-07T20:32:14.1817891Z     D=5120,
2025-05-07T20:32:14.1817979Z     scale_ub=1200.0,
2025-05-07T20:32:14.1818178Z     contiguous=True,
2025-05-07T20:32:14.1818277Z     compiled=False,
2025-05-07T20:32:14.1818353Z )
2025-05-07T20:32:14.1818580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1818769Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.1818774Z 
2025-05-07T20:32:14.1818855Z     @given(
2025-05-07T20:32:14.1818985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1819095Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1819215Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1819345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1819466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1819544Z     )
2025-05-07T20:32:14.1819803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1819901Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1819979Z         self,
2025-05-07T20:32:14.1820114Z         T: int,
2025-05-07T20:32:14.1820192Z         D: int,
2025-05-07T20:32:14.1820299Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1820392Z         contiguous: bool,
2025-05-07T20:32:14.1820479Z         compiled: bool,
2025-05-07T20:32:14.1820565Z     ) -> None:
2025-05-07T20:32:14.1820664Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1820739Z     
2025-05-07T20:32:14.1820919Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1820999Z     
2025-05-07T20:32:14.1821094Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1821230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1821365Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1821447Z         x0 = x[:, :D]
2025-05-07T20:32:14.1821534Z         x1 = x[:, D:]
2025-05-07T20:32:14.1821608Z     
2025-05-07T20:32:14.1821693Z         if contiguous:
2025-05-07T20:32:14.1821791Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1821881Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1821962Z     
2025-05-07T20:32:14.1822055Z         if scale_ub is not None:
2025-05-07T20:32:14.1822163Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1822312Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1822393Z             )
2025-05-07T20:32:14.1822489Z         else:
2025-05-07T20:32:14.1822598Z             scale_ub_tensor = None
2025-05-07T20:32:14.1822688Z     
2025-05-07T20:32:14.1822825Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1822924Z             op = silu_mul_quant
2025-05-07T20:32:14.1823010Z             if compiled:
2025-05-07T20:32:14.1823115Z                 op = torch.compile(op)
2025-05-07T20:32:14.1823229Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1823302Z     
2025-05-07T20:32:14.1823404Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1823409Z 
2025-05-07T20:32:14.1823508Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1823638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1823752Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1823855Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1824368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1824475Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1824927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1825164Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1825517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1825614Z     kernel = self.compile(
2025-05-07T20:32:14.1826012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1826196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1826322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1826333Z 
2025-05-07T20:32:14.1826545Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d603700>
2025-05-07T20:32:14.1827348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1827871Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9df6cdc0>}
2025-05-07T20:32:14.1828639Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1828884Z context = <triton._C.libtriton.ir.context object at 0x7fac9d109330>
2025-05-07T20:32:14.1828889Z 
2025-05-07T20:32:14.1829061Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1829333Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1829452Z                            module_map=module_map)
2025-05-07T20:32:14.1829624Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1829736Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1829815Z E       ^
2025-05-07T20:32:14.1830221Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1830226Z 
2025-05-07T20:32:14.1830655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1830662Z 
2025-05-07T20:32:14.1830772Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1831003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1831092Z     T=1,
2025-05-07T20:32:14.1831172Z     D=5120,
2025-05-07T20:32:14.1831263Z     scale_ub=None,
2025-05-07T20:32:14.1831352Z     contiguous=True,
2025-05-07T20:32:14.1831438Z     compiled=True,
2025-05-07T20:32:14.1831519Z )
2025-05-07T20:32:14.1831751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1831917Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.1831925Z 
2025-05-07T20:32:14.1832011Z     @given(
2025-05-07T20:32:14.1832139Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1832242Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1832367Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1838530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1838680Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1838762Z     )
2025-05-07T20:32:14.1839023Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1839121Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1839204Z         self,
2025-05-07T20:32:14.1839280Z         T: int,
2025-05-07T20:32:14.1839358Z         D: int,
2025-05-07T20:32:14.1839464Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1839671Z         contiguous: bool,
2025-05-07T20:32:14.1839765Z         compiled: bool,
2025-05-07T20:32:14.1839846Z     ) -> None:
2025-05-07T20:32:14.1839944Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1840024Z     
2025-05-07T20:32:14.1840205Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1840279Z     
2025-05-07T20:32:14.1840379Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1840508Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1840601Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1840687Z         x0 = x[:, :D]
2025-05-07T20:32:14.1840767Z         x1 = x[:, D:]
2025-05-07T20:32:14.1840840Z     
2025-05-07T20:32:14.1840933Z         if contiguous:
2025-05-07T20:32:14.1841029Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1841128Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1841206Z     
2025-05-07T20:32:14.1841302Z         if scale_ub is not None:
2025-05-07T20:32:14.1841425Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1841564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1841642Z             )
2025-05-07T20:32:14.1841733Z         else:
2025-05-07T20:32:14.1841830Z             scale_ub_tensor = None
2025-05-07T20:32:14.1841902Z     
2025-05-07T20:32:14.1842047Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1842140Z             op = silu_mul_quant
2025-05-07T20:32:14.1842227Z             if compiled:
2025-05-07T20:32:14.1842386Z                 op = torch.compile(op)
2025-05-07T20:32:14.1842495Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1842569Z     
2025-05-07T20:32:14.1842670Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.1842797Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.1842879Z     
2025-05-07T20:32:14.1843019Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1843124Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.1843240Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.1843365Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.1843553Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1843639Z     
2025-05-07T20:32:14.1843741Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.1843746Z 
2025-05-07T20:32:14.1843856Z moe/activation_test.py:126: 
2025-05-07T20:32:14.1843989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1844101Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.1844248Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1844832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.1844938Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.1845319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1845550Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1845938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.1846201Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1846610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.1846879Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1847267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.1847448Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.1847907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.1847987Z     fn()
2025-05-07T20:32:14.1848408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.1848496Z     self.fn.run(
2025-05-07T20:32:14.1848844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1848946Z     kernel = self.compile(
2025-05-07T20:32:14.1849337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1849528Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1849660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1849665Z 
2025-05-07T20:32:14.1849879Z self = <triton.compiler.compiler.ASTSource object at 0x7facc211c3d0>
2025-05-07T20:32:14.1850692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1851214Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7facc2018430>}
2025-05-07T20:32:14.1851990Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1852234Z context = <triton._C.libtriton.ir.context object at 0x7fac9d1b5870>
2025-05-07T20:32:14.1852238Z 
2025-05-07T20:32:14.1852413Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1852742Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1852855Z                            module_map=module_map)
2025-05-07T20:32:14.1853032Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1853182Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.1853261Z E       ^
2025-05-07T20:32:14.1853636Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1853641Z 
2025-05-07T20:32:14.1854069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1854079Z 
2025-05-07T20:32:14.1854197Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1854428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1854510Z     T=2048,
2025-05-07T20:32:14.1854600Z     D=5120,
2025-05-07T20:32:14.1854688Z     scale_ub=None,
2025-05-07T20:32:14.1854778Z     contiguous=True,
2025-05-07T20:32:14.1854876Z     compiled=True,
2025-05-07T20:32:14.1854954Z )
2025-05-07T20:32:14.1855183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1855370Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.1855375Z 
2025-05-07T20:32:14.1855455Z     @given(
2025-05-07T20:32:14.1855860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1856019Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1856163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1856301Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1856425Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1856503Z     )
2025-05-07T20:32:14.1856765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1856864Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1856943Z         self,
2025-05-07T20:32:14.1857032Z         T: int,
2025-05-07T20:32:14.1857312Z         D: int,
2025-05-07T20:32:14.1857420Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1857520Z         contiguous: bool,
2025-05-07T20:32:14.1857610Z         compiled: bool,
2025-05-07T20:32:14.1857697Z     ) -> None:
2025-05-07T20:32:14.1857803Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1857879Z     
2025-05-07T20:32:14.1858121Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1858205Z     
2025-05-07T20:32:14.1858303Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1858436Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1858533Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1858617Z         x0 = x[:, :D]
2025-05-07T20:32:14.1858699Z         x1 = x[:, D:]
2025-05-07T20:32:14.1858778Z     
2025-05-07T20:32:14.1858865Z         if contiguous:
2025-05-07T20:32:14.1858962Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1859058Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1859138Z     
2025-05-07T20:32:14.1859232Z         if scale_ub is not None:
2025-05-07T20:32:14.1859346Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1859488Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1859572Z             )
2025-05-07T20:32:14.1859650Z         else:
2025-05-07T20:32:14.1859748Z             scale_ub_tensor = None
2025-05-07T20:32:14.1859828Z     
2025-05-07T20:32:14.1859963Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1860127Z             op = silu_mul_quant
2025-05-07T20:32:14.1860231Z             if compiled:
2025-05-07T20:32:14.1860334Z                 op = torch.compile(op)
2025-05-07T20:32:14.1860443Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1860524Z     
2025-05-07T20:32:14.1860619Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.1860745Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.1860825Z     
2025-05-07T20:32:14.1860974Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1861086Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.1861189Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.1861382Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.1861538Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1861614Z     
2025-05-07T20:32:14.1861720Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.1861725Z 
2025-05-07T20:32:14.1861835Z moe/activation_test.py:126: 
2025-05-07T20:32:14.1861975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1862092Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.1862233Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1862843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.1862981Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.1863351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1863584Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1863970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.1864233Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1864651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.1864909Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1865294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.1865554Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.1865908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.1865997Z     fn()
2025-05-07T20:32:14.1866409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.1866495Z     self.fn.run(
2025-05-07T20:32:14.1866849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1866956Z     kernel = self.compile(
2025-05-07T20:32:14.1867347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1867534Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1867667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1867671Z 
2025-05-07T20:32:14.1867901Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d357be0>
2025-05-07T20:32:14.1868703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1869224Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9d8c2d40>}
2025-05-07T20:32:14.1870043Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1870245Z context = <triton._C.libtriton.ir.context object at 0x7fac9d203030>
2025-05-07T20:32:14.1870250Z 
2025-05-07T20:32:14.1870428Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1870710Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1870822Z                            module_map=module_map)
2025-05-07T20:32:14.1871037Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1871145Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.1871230Z E       ^
2025-05-07T20:32:14.1871596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1871603Z 
2025-05-07T20:32:14.1872028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1872033Z 
2025-05-07T20:32:14.1872147Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1872381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1872467Z     T=128,
2025-05-07T20:32:14.1872545Z     D=5120,
2025-05-07T20:32:14.1872637Z     scale_ub=None,
2025-05-07T20:32:14.1872731Z     contiguous=True,
2025-05-07T20:32:14.1872818Z     compiled=True,
2025-05-07T20:32:14.1872893Z )
2025-05-07T20:32:14.1873127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1873301Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.1873306Z 
2025-05-07T20:32:14.1873386Z     @given(
2025-05-07T20:32:14.1873515Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1873621Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1873747Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1873870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1873988Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1874070Z     )
2025-05-07T20:32:14.1874322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1874501Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1874588Z         self,
2025-05-07T20:32:14.1874667Z         T: int,
2025-05-07T20:32:14.1874746Z         D: int,
2025-05-07T20:32:14.1874857Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1874951Z         contiguous: bool,
2025-05-07T20:32:14.1875040Z         compiled: bool,
2025-05-07T20:32:14.1875134Z     ) -> None:
2025-05-07T20:32:14.1875233Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1875315Z     
2025-05-07T20:32:14.1875491Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1875569Z     
2025-05-07T20:32:14.1875669Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1875800Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1875892Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1875980Z         x0 = x[:, :D]
2025-05-07T20:32:14.1876062Z         x1 = x[:, D:]
2025-05-07T20:32:14.1876137Z     
2025-05-07T20:32:14.1876229Z         if contiguous:
2025-05-07T20:32:14.1876330Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1876426Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1876507Z     
2025-05-07T20:32:14.1876601Z         if scale_ub is not None:
2025-05-07T20:32:14.1876713Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1876862Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1876944Z             )
2025-05-07T20:32:14.1877028Z         else:
2025-05-07T20:32:14.1877127Z             scale_ub_tensor = None
2025-05-07T20:32:14.1877202Z     
2025-05-07T20:32:14.1877389Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1877484Z             op = silu_mul_quant
2025-05-07T20:32:14.1877572Z             if compiled:
2025-05-07T20:32:14.1877685Z                 op = torch.compile(op)
2025-05-07T20:32:14.1877798Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1877873Z     
2025-05-07T20:32:14.1877974Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.1878107Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.1878182Z     
2025-05-07T20:32:14.1878331Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1878507Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.1878619Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.1878749Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.1878896Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1878978Z     
2025-05-07T20:32:14.1879087Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.1879092Z 
2025-05-07T20:32:14.1879196Z moe/activation_test.py:126: 
2025-05-07T20:32:14.1879332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1879442Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.1879590Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1880169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.1880276Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.1880658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1880886Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1881263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.1881536Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1881946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.1882213Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1882725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.1882903Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.1883266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.1883347Z     fn()
2025-05-07T20:32:14.1883765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.1883851Z     self.fn.run(
2025-05-07T20:32:14.1884203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1884310Z     kernel = self.compile(
2025-05-07T20:32:14.1884701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1884881Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1885025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1885029Z 
2025-05-07T20:32:14.1885242Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d601d50>
2025-05-07T20:32:14.1886054Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1886576Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9d8c2e60>}
2025-05-07T20:32:14.1887391Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1887589Z context = <triton._C.libtriton.ir.context object at 0x7fac9cea2930>
2025-05-07T20:32:14.1887600Z 
2025-05-07T20:32:14.1887771Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1888053Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1888208Z                            module_map=module_map)
2025-05-07T20:32:14.1888377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1888490Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.1888569Z E       ^
2025-05-07T20:32:14.1888944Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1888949Z 
2025-05-07T20:32:14.1889376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1889380Z 
2025-05-07T20:32:14.1889489Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1889729Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1889809Z     T=4096,
2025-05-07T20:32:14.1889894Z     D=5120,
2025-05-07T20:32:14.1889979Z     scale_ub=None,
2025-05-07T20:32:14.1890071Z     contiguous=True,
2025-05-07T20:32:14.1890163Z     compiled=True,
2025-05-07T20:32:14.1890239Z )
2025-05-07T20:32:14.1890465Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1890648Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.1890653Z 
2025-05-07T20:32:14.1890737Z     @given(
2025-05-07T20:32:14.1890863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1890974Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1891095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1891224Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1891343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1891420Z     )
2025-05-07T20:32:14.1891778Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1891879Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1891960Z         self,
2025-05-07T20:32:14.1892048Z         T: int,
2025-05-07T20:32:14.1892126Z         D: int,
2025-05-07T20:32:14.1892230Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1892329Z         contiguous: bool,
2025-05-07T20:32:14.1892418Z         compiled: bool,
2025-05-07T20:32:14.1892500Z     ) -> None:
2025-05-07T20:32:14.1892606Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1892686Z     
2025-05-07T20:32:14.1892900Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1892992Z     
2025-05-07T20:32:14.1893089Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1893225Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1893318Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1893399Z         x0 = x[:, :D]
2025-05-07T20:32:14.1893486Z         x1 = x[:, D:]
2025-05-07T20:32:14.1893566Z     
2025-05-07T20:32:14.1893654Z         if contiguous:
2025-05-07T20:32:14.1893754Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1893848Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1893923Z     
2025-05-07T20:32:14.1894025Z         if scale_ub is not None:
2025-05-07T20:32:14.1894135Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1894275Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1894359Z             )
2025-05-07T20:32:14.1894509Z         else:
2025-05-07T20:32:14.1894617Z             scale_ub_tensor = None
2025-05-07T20:32:14.1894693Z     
2025-05-07T20:32:14.1894839Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1894975Z             op = silu_mul_quant
2025-05-07T20:32:14.1895101Z             if compiled:
2025-05-07T20:32:14.1895241Z                 op = torch.compile(op)
2025-05-07T20:32:14.1895375Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1895449Z     
2025-05-07T20:32:14.1895550Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.1895683Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.1895812Z     
2025-05-07T20:32:14.1895954Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1896065Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.1896166Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.1896297Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.1896442Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1896520Z     
2025-05-07T20:32:14.1896628Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.1896633Z 
2025-05-07T20:32:14.1896734Z moe/activation_test.py:126: 
2025-05-07T20:32:14.1896863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1896979Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.1897122Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1897704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.1897811Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.1898260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1898495Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1898876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.1899138Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1899554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.1899895Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1900295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.1900470Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.1900822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.1900906Z     fn()
2025-05-07T20:32:14.1901317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.1901408Z     self.fn.run(
2025-05-07T20:32:14.1901756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1901851Z     kernel = self.compile(
2025-05-07T20:32:14.1902245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1902430Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1902573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1902581Z 
2025-05-07T20:32:14.1902794Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d354130>
2025-05-07T20:32:14.1903710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1904293Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9d456320>}
2025-05-07T20:32:14.1905144Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1905360Z context = <triton._C.libtriton.ir.context object at 0x7fac9c8943b0>
2025-05-07T20:32:14.1905365Z 
2025-05-07T20:32:14.1905535Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1905954Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1906107Z                            module_map=module_map)
2025-05-07T20:32:14.1906319Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1906480Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.1906585Z E       ^
2025-05-07T20:32:14.1907104Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1907121Z 
2025-05-07T20:32:14.1907555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1907560Z 
2025-05-07T20:32:14.1907669Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1907905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1907983Z     T=16384,
2025-05-07T20:32:14.1908066Z     D=5120,
2025-05-07T20:32:14.1908154Z     scale_ub=None,
2025-05-07T20:32:14.1908240Z     contiguous=True,
2025-05-07T20:32:14.1908324Z     compiled=True,
2025-05-07T20:32:14.1908405Z )
2025-05-07T20:32:14.1908629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1908817Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.1908821Z 
2025-05-07T20:32:14.1908903Z     @given(
2025-05-07T20:32:14.1909026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1909135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1909252Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1909370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1909595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1909670Z     )
2025-05-07T20:32:14.1909923Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1910026Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1910102Z         self,
2025-05-07T20:32:14.1910184Z         T: int,
2025-05-07T20:32:14.1910260Z         D: int,
2025-05-07T20:32:14.1910360Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1910455Z         contiguous: bool,
2025-05-07T20:32:14.1910544Z         compiled: bool,
2025-05-07T20:32:14.1910623Z     ) -> None:
2025-05-07T20:32:14.1910723Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1910798Z     
2025-05-07T20:32:14.1910969Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1911048Z     
2025-05-07T20:32:14.1911141Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1911271Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1911370Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1911450Z         x0 = x[:, :D]
2025-05-07T20:32:14.1911537Z         x1 = x[:, D:]
2025-05-07T20:32:14.1911609Z     
2025-05-07T20:32:14.1911696Z         if contiguous:
2025-05-07T20:32:14.1911792Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1911884Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1911956Z     
2025-05-07T20:32:14.1912053Z         if scale_ub is not None:
2025-05-07T20:32:14.1912164Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1912315Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1912491Z             )
2025-05-07T20:32:14.1912583Z         else:
2025-05-07T20:32:14.1912678Z             scale_ub_tensor = None
2025-05-07T20:32:14.1912756Z     
2025-05-07T20:32:14.1912888Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1912981Z             op = silu_mul_quant
2025-05-07T20:32:14.1913074Z             if compiled:
2025-05-07T20:32:14.1913181Z                 op = torch.compile(op)
2025-05-07T20:32:14.1913294Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1913367Z     
2025-05-07T20:32:14.1913460Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.1913634Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.1913708Z     
2025-05-07T20:32:14.1913848Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1913957Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.1914057Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.1914183Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.1914332Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1914407Z     
2025-05-07T20:32:14.1914519Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.1914524Z 
2025-05-07T20:32:14.1914624Z moe/activation_test.py:126: 
2025-05-07T20:32:14.1914752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1914871Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.1915012Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1915593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.1915706Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.1916073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1916310Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1916686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.1916947Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1917442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.1917701Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1918097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.1918270Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.1918622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.1918708Z     fn()
2025-05-07T20:32:14.1919119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.1919203Z     self.fn.run(
2025-05-07T20:32:14.1919557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1919653Z     kernel = self.compile(
2025-05-07T20:32:14.1920063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1920242Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1920374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1920379Z 
2025-05-07T20:32:14.1920597Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9c797010>
2025-05-07T20:32:14.1921398Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1921965Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9d8c25f0>}
2025-05-07T20:32:14.1922791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1922987Z context = <triton._C.libtriton.ir.context object at 0x7fac9c311130>
2025-05-07T20:32:14.1923032Z 
2025-05-07T20:32:14.1923207Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1923479Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1923595Z                            module_map=module_map)
2025-05-07T20:32:14.1923761Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1923865Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.1923949Z E       ^
2025-05-07T20:32:14.1924314Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1924318Z 
2025-05-07T20:32:14.1924754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1924759Z 
2025-05-07T20:32:14.1924865Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1925097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1925183Z     T=1,
2025-05-07T20:32:14.1925261Z     D=5120,
2025-05-07T20:32:14.1925345Z     scale_ub=1200.0,
2025-05-07T20:32:14.1925437Z     contiguous=True,
2025-05-07T20:32:14.1925521Z     compiled=True,
2025-05-07T20:32:14.1925594Z )
2025-05-07T20:32:14.1925826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1925997Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.1926002Z 
2025-05-07T20:32:14.1926086Z     @given(
2025-05-07T20:32:14.1926207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1926310Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1926439Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1926636Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1926754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1926837Z     )
2025-05-07T20:32:14.1927090Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1927185Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1927267Z         self,
2025-05-07T20:32:14.1927343Z         T: int,
2025-05-07T20:32:14.1927423Z         D: int,
2025-05-07T20:32:14.1927524Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1927617Z         contiguous: bool,
2025-05-07T20:32:14.1927709Z         compiled: bool,
2025-05-07T20:32:14.1927788Z     ) -> None:
2025-05-07T20:32:14.1927884Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1927962Z     
2025-05-07T20:32:14.1928133Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1928207Z     
2025-05-07T20:32:14.1928310Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1928442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1928530Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1928616Z         x0 = x[:, :D]
2025-05-07T20:32:14.1928698Z         x1 = x[:, D:]
2025-05-07T20:32:14.1928770Z     
2025-05-07T20:32:14.1928864Z         if contiguous:
2025-05-07T20:32:14.1928958Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1929052Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1929124Z     
2025-05-07T20:32:14.1929216Z         if scale_ub is not None:
2025-05-07T20:32:14.1929375Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1929512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1929588Z             )
2025-05-07T20:32:14.1929669Z         else:
2025-05-07T20:32:14.1929764Z             scale_ub_tensor = None
2025-05-07T20:32:14.1929837Z     
2025-05-07T20:32:14.1929977Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1930069Z             op = silu_mul_quant
2025-05-07T20:32:14.1930160Z             if compiled:
2025-05-07T20:32:14.1930268Z                 op = torch.compile(op)
2025-05-07T20:32:14.1930375Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1930499Z     
2025-05-07T20:32:14.1930590Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1930595Z 
2025-05-07T20:32:14.1930692Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1930829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1930932Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1931037Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1931427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.1931521Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.1932037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1932142Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1932537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1932801Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1933151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1933246Z     kernel = self.compile(
2025-05-07T20:32:14.1933645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1933826Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1933958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1933963Z 
2025-05-07T20:32:14.1934173Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9ccabca0>
2025-05-07T20:32:14.1935048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1935580Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9cff05e0>}
2025-05-07T20:32:14.1936343Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1936547Z context = <triton._C.libtriton.ir.context object at 0x7fac9c3d4ff0>
2025-05-07T20:32:14.1936552Z 
2025-05-07T20:32:14.1936721Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1936997Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1937112Z                            module_map=module_map)
2025-05-07T20:32:14.1937277Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1937388Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1937468Z E       ^
2025-05-07T20:32:14.1937830Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1937835Z 
2025-05-07T20:32:14.1938383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1938440Z 
2025-05-07T20:32:14.1938550Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1938787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1938866Z     T=1,
2025-05-07T20:32:14.1938945Z     D=5120,
2025-05-07T20:32:14.1939035Z     scale_ub=None,
2025-05-07T20:32:14.1939125Z     contiguous=False,
2025-05-07T20:32:14.1939210Z     compiled=True,
2025-05-07T20:32:14.1939297Z )
2025-05-07T20:32:14.1939521Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1939689Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.1939742Z 
2025-05-07T20:32:14.1939819Z     @given(
2025-05-07T20:32:14.1939941Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1940048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1940167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1940295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1940417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1940492Z     )
2025-05-07T20:32:14.1940743Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1940843Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1940920Z         self,
2025-05-07T20:32:14.1940997Z         T: int,
2025-05-07T20:32:14.1941084Z         D: int,
2025-05-07T20:32:14.1941184Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1941282Z         contiguous: bool,
2025-05-07T20:32:14.1941370Z         compiled: bool,
2025-05-07T20:32:14.1941452Z     ) -> None:
2025-05-07T20:32:14.1941560Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1941633Z     
2025-05-07T20:32:14.1941806Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1941887Z     
2025-05-07T20:32:14.1941979Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1942109Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1942207Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1942288Z         x0 = x[:, :D]
2025-05-07T20:32:14.1942374Z         x1 = x[:, D:]
2025-05-07T20:32:14.1942471Z     
2025-05-07T20:32:14.1942559Z         if contiguous:
2025-05-07T20:32:14.1942669Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1942766Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1942837Z     
2025-05-07T20:32:14.1943039Z         if scale_ub is not None:
2025-05-07T20:32:14.1943147Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1943284Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1943369Z             )
2025-05-07T20:32:14.1943445Z         else:
2025-05-07T20:32:14.1943539Z             scale_ub_tensor = None
2025-05-07T20:32:14.1943618Z     
2025-05-07T20:32:14.1943749Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1943839Z             op = silu_mul_quant
2025-05-07T20:32:14.1943933Z             if compiled:
2025-05-07T20:32:14.1944034Z                 op = torch.compile(op)
2025-05-07T20:32:14.1944141Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1944220Z     
2025-05-07T20:32:14.1944311Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.1944440Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.1944514Z     
2025-05-07T20:32:14.1944657Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1944766Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.1944868Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.1944995Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.1945144Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1945217Z     
2025-05-07T20:32:14.1945319Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.1945324Z 
2025-05-07T20:32:14.1945429Z moe/activation_test.py:126: 
2025-05-07T20:32:14.1945604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1945717Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.1945854Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.1946428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.1946542Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.1946908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1947175Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1947557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.1947818Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1948235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.1948492Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.1948873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.1949054Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.1949403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.1949494Z     fn()
2025-05-07T20:32:14.1949903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.1949985Z     self.fn.run(
2025-05-07T20:32:14.1950337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1950434Z     kernel = self.compile(
2025-05-07T20:32:14.1950822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1951007Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1951136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1951140Z 
2025-05-07T20:32:14.1951436Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9c937130>
2025-05-07T20:32:14.1952237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1952800Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9cff1090>}
2025-05-07T20:32:14.1953586Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1953784Z context = <triton._C.libtriton.ir.context object at 0x7faaebdadeb0>
2025-05-07T20:32:14.1953789Z 
2025-05-07T20:32:14.1953963Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1954239Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1954355Z                            module_map=module_map)
2025-05-07T20:32:14.1954522Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1954627Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.1954710Z E       ^
2025-05-07T20:32:14.1955073Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1955120Z 
2025-05-07T20:32:14.1955961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1955981Z 
2025-05-07T20:32:14.1956097Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1956329Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1956415Z     T=1,
2025-05-07T20:32:14.1956495Z     D=5120,
2025-05-07T20:32:14.1956585Z     scale_ub=None,
2025-05-07T20:32:14.1956677Z     contiguous=True,
2025-05-07T20:32:14.1956763Z     compiled=False,
2025-05-07T20:32:14.1956836Z )
2025-05-07T20:32:14.1957230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1957398Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.1957402Z 
2025-05-07T20:32:14.1957480Z     @given(
2025-05-07T20:32:14.1957610Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1957714Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1957839Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1957962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1958079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1958161Z     )
2025-05-07T20:32:14.1958414Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1958507Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1958597Z         self,
2025-05-07T20:32:14.1958676Z         T: int,
2025-05-07T20:32:14.1958752Z         D: int,
2025-05-07T20:32:14.1958858Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1958952Z         contiguous: bool,
2025-05-07T20:32:14.1959047Z         compiled: bool,
2025-05-07T20:32:14.1959125Z     ) -> None:
2025-05-07T20:32:14.1959222Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1959299Z     
2025-05-07T20:32:14.1959473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1959547Z     
2025-05-07T20:32:14.1959646Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1959774Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1959865Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1959950Z         x0 = x[:, :D]
2025-05-07T20:32:14.1960029Z         x1 = x[:, D:]
2025-05-07T20:32:14.1960100Z     
2025-05-07T20:32:14.1960191Z         if contiguous:
2025-05-07T20:32:14.1960283Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1960583Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1960665Z     
2025-05-07T20:32:14.1960758Z         if scale_ub is not None:
2025-05-07T20:32:14.1960875Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1961012Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1961087Z             )
2025-05-07T20:32:14.1961169Z         else:
2025-05-07T20:32:14.1961264Z             scale_ub_tensor = None
2025-05-07T20:32:14.1961337Z     
2025-05-07T20:32:14.1961480Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1961571Z             op = silu_mul_quant
2025-05-07T20:32:14.1961656Z             if compiled:
2025-05-07T20:32:14.1961764Z                 op = torch.compile(op)
2025-05-07T20:32:14.1961872Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1961944Z     
2025-05-07T20:32:14.1962041Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1962045Z 
2025-05-07T20:32:14.1962150Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1962287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1962389Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1962493Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1963016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1963116Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1963484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1963781Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1964131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1964232Z     kernel = self.compile(
2025-05-07T20:32:14.1964629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1964808Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1964982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1964987Z 
2025-05-07T20:32:14.1965198Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d3589d0>
2025-05-07T20:32:14.1966002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1966520Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9cff3880>}
2025-05-07T20:32:14.1967292Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1967495Z context = <triton._C.libtriton.ir.context object at 0x7faaebdf3670>
2025-05-07T20:32:14.1967502Z 
2025-05-07T20:32:14.1967670Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1967952Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1968061Z                            module_map=module_map)
2025-05-07T20:32:14.1968227Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1968334Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1968408Z E       ^
2025-05-07T20:32:14.1968788Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1968792Z 
2025-05-07T20:32:14.1969294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1969299Z 
2025-05-07T20:32:14.1969408Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1969643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1975410Z     T=128,
2025-05-07T20:32:14.1975520Z     D=5120,
2025-05-07T20:32:14.1975613Z     scale_ub=None,
2025-05-07T20:32:14.1975702Z     contiguous=False,
2025-05-07T20:32:14.1975795Z     compiled=True,
2025-05-07T20:32:14.1975870Z )
2025-05-07T20:32:14.1976104Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1976293Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.1976299Z 
2025-05-07T20:32:14.1976377Z     @given(
2025-05-07T20:32:14.1976511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1976613Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1976732Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1976864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1976981Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1977055Z     )
2025-05-07T20:32:14.1977318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1977416Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1977494Z         self,
2025-05-07T20:32:14.1977579Z         T: int,
2025-05-07T20:32:14.1977654Z         D: int,
2025-05-07T20:32:14.1977755Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1977933Z         contiguous: bool,
2025-05-07T20:32:14.1978104Z         compiled: bool,
2025-05-07T20:32:14.1978224Z     ) -> None:
2025-05-07T20:32:14.1978325Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1978400Z     
2025-05-07T20:32:14.1978583Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1978659Z     
2025-05-07T20:32:14.1978754Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1978897Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1978988Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1979069Z         x0 = x[:, :D]
2025-05-07T20:32:14.1979214Z         x1 = x[:, D:]
2025-05-07T20:32:14.1979288Z     
2025-05-07T20:32:14.1979376Z         if contiguous:
2025-05-07T20:32:14.1979479Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1979572Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1979654Z     
2025-05-07T20:32:14.1979747Z         if scale_ub is not None:
2025-05-07T20:32:14.1979858Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1980011Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1980087Z             )
2025-05-07T20:32:14.1980165Z         else:
2025-05-07T20:32:14.1980271Z             scale_ub_tensor = None
2025-05-07T20:32:14.1980344Z     
2025-05-07T20:32:14.1980480Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1980582Z             op = silu_mul_quant
2025-05-07T20:32:14.1980674Z             if compiled:
2025-05-07T20:32:14.1980777Z                 op = torch.compile(op)
2025-05-07T20:32:14.1980894Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1980973Z     
2025-05-07T20:32:14.1981070Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1981082Z 
2025-05-07T20:32:14.1981184Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1981320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1981436Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1981541Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1981925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.1982029Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.1982559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1982682Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1983180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1983416Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1983774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1983873Z     kernel = self.compile(
2025-05-07T20:32:14.1984266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1984458Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1984588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1984593Z 
2025-05-07T20:32:14.1984813Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9c566da0>
2025-05-07T20:32:14.1985621Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1986146Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9cff3eb0>}
2025-05-07T20:32:14.1986922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.1987162Z context = <triton._C.libtriton.ir.context object at 0x7faaebdc4df0>
2025-05-07T20:32:14.1987167Z 
2025-05-07T20:32:14.1987345Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.1987618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.1987735Z                            module_map=module_map)
2025-05-07T20:32:14.1987915Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.1988020Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.1988147Z E       ^
2025-05-07T20:32:14.1988513Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.1988518Z 
2025-05-07T20:32:14.1988945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.1988952Z 
2025-05-07T20:32:14.1989070Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.1989301Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.1989389Z     T=128,
2025-05-07T20:32:14.1989467Z     D=7168,
2025-05-07T20:32:14.1989552Z     scale_ub=1200.0,
2025-05-07T20:32:14.1989649Z     contiguous=False,
2025-05-07T20:32:14.1989734Z     compiled=False,
2025-05-07T20:32:14.1989809Z )
2025-05-07T20:32:14.1990045Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.1990224Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.1990232Z 
2025-05-07T20:32:14.1990311Z     @given(
2025-05-07T20:32:14.1990441Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.1990543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.1990668Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.1990793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.1990909Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.1990989Z     )
2025-05-07T20:32:14.1991241Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.1991337Z     def test_silu_mul_quant(
2025-05-07T20:32:14.1991422Z         self,
2025-05-07T20:32:14.1991500Z         T: int,
2025-05-07T20:32:14.1991577Z         D: int,
2025-05-07T20:32:14.1991767Z         scale_ub: Optional[float],
2025-05-07T20:32:14.1991860Z         contiguous: bool,
2025-05-07T20:32:14.1991947Z         compiled: bool,
2025-05-07T20:32:14.1992036Z     ) -> None:
2025-05-07T20:32:14.1992133Z         torch.manual_seed(2025)
2025-05-07T20:32:14.1992211Z     
2025-05-07T20:32:14.1992385Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.1992461Z     
2025-05-07T20:32:14.1992552Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.1992689Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.1992781Z         x = x_sign * x_clamp
2025-05-07T20:32:14.1992863Z         x0 = x[:, :D]
2025-05-07T20:32:14.1992951Z         x1 = x[:, D:]
2025-05-07T20:32:14.1993024Z     
2025-05-07T20:32:14.1993107Z         if contiguous:
2025-05-07T20:32:14.1993205Z             x0 = x0.contiguous()
2025-05-07T20:32:14.1993296Z             x1 = x1.contiguous()
2025-05-07T20:32:14.1993369Z     
2025-05-07T20:32:14.1993473Z         if scale_ub is not None:
2025-05-07T20:32:14.1993579Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.1993722Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.1993802Z             )
2025-05-07T20:32:14.1993878Z         else:
2025-05-07T20:32:14.1993978Z             scale_ub_tensor = None
2025-05-07T20:32:14.1994050Z     
2025-05-07T20:32:14.1994182Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.1994281Z             op = silu_mul_quant
2025-05-07T20:32:14.1994413Z             if compiled:
2025-05-07T20:32:14.1994516Z                 op = torch.compile(op)
2025-05-07T20:32:14.1994631Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1994703Z     
2025-05-07T20:32:14.1994795Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.1994800Z 
2025-05-07T20:32:14.1994908Z moe/activation_test.py:117: 
2025-05-07T20:32:14.1995040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1995157Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.1995260Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.1995774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.1995931Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.1996300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.1996528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.1996887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.1996986Z     kernel = self.compile(
2025-05-07T20:32:14.1997385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.1997569Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.1997697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.1997702Z 
2025-05-07T20:32:14.1997924Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9d5c6920>
2025-05-07T20:32:14.1998726Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.1999254Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9cff0d30>}
2025-05-07T20:32:14.2000024Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2000316Z context = <triton._C.libtriton.ir.context object at 0x7faaebeac8f0>
2025-05-07T20:32:14.2000322Z 
2025-05-07T20:32:14.2000495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2000773Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2000892Z                            module_map=module_map)
2025-05-07T20:32:14.2001059Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2001163Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2001254Z E       ^
2025-05-07T20:32:14.2001621Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2001626Z 
2025-05-07T20:32:14.2002061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2002065Z 
2025-05-07T20:32:14.2002174Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2002434Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2002527Z     T=128,
2025-05-07T20:32:14.2002626Z     D=5120,
2025-05-07T20:32:14.2002717Z     scale_ub=None,
2025-05-07T20:32:14.2002816Z     contiguous=False,
2025-05-07T20:32:14.2002907Z     compiled=False,
2025-05-07T20:32:14.2002983Z )
2025-05-07T20:32:14.2003214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2003391Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.2003440Z 
2025-05-07T20:32:14.2003530Z     @given(
2025-05-07T20:32:14.2003655Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2003762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2003889Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2004012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2004131Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2004224Z     )
2025-05-07T20:32:14.2004478Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2004630Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2004709Z         self,
2025-05-07T20:32:14.2004789Z         T: int,
2025-05-07T20:32:14.2004874Z         D: int,
2025-05-07T20:32:14.2004978Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2005071Z         contiguous: bool,
2025-05-07T20:32:14.2005164Z         compiled: bool,
2025-05-07T20:32:14.2005247Z     ) -> None:
2025-05-07T20:32:14.2005383Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2005497Z     
2025-05-07T20:32:14.2005741Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2005838Z     
2025-05-07T20:32:14.2005940Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2006069Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2006168Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2006257Z         x0 = x[:, :D]
2025-05-07T20:32:14.2006341Z         x1 = x[:, D:]
2025-05-07T20:32:14.2006423Z     
2025-05-07T20:32:14.2006511Z         if contiguous:
2025-05-07T20:32:14.2006615Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2006716Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2006791Z     
2025-05-07T20:32:14.2006886Z         if scale_ub is not None:
2025-05-07T20:32:14.2007001Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2007141Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2007224Z             )
2025-05-07T20:32:14.2007311Z         else:
2025-05-07T20:32:14.2007426Z             scale_ub_tensor = None
2025-05-07T20:32:14.2007535Z     
2025-05-07T20:32:14.2007709Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2007805Z             op = silu_mul_quant
2025-05-07T20:32:14.2007900Z             if compiled:
2025-05-07T20:32:14.2008002Z                 op = torch.compile(op)
2025-05-07T20:32:14.2008212Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2008294Z     
2025-05-07T20:32:14.2008389Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2008393Z 
2025-05-07T20:32:14.2008497Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2008691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2008832Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2008937Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2009538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2009650Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2010029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2010258Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2010616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2010722Z     kernel = self.compile(
2025-05-07T20:32:14.2011115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2011305Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2011433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2011438Z 
2025-05-07T20:32:14.2011652Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebe592a0>
2025-05-07T20:32:14.2012519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2013097Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9c9a6cb0>}
2025-05-07T20:32:14.2013871Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2014144Z context = <triton._C.libtriton.ir.context object at 0x7faaebe900b0>
2025-05-07T20:32:14.2014149Z 
2025-05-07T20:32:14.2014324Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2014609Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2014720Z                            module_map=module_map)
2025-05-07T20:32:14.2014894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2014995Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2015076Z E       ^
2025-05-07T20:32:14.2015452Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2015457Z 
2025-05-07T20:32:14.2015884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2015891Z 
2025-05-07T20:32:14.2016005Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2016237Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2016316Z     T=128,
2025-05-07T20:32:14.2016401Z     D=5120,
2025-05-07T20:32:14.2016490Z     scale_ub=1200.0,
2025-05-07T20:32:14.2016577Z     contiguous=True,
2025-05-07T20:32:14.2016671Z     compiled=False,
2025-05-07T20:32:14.2016745Z )
2025-05-07T20:32:14.2016971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2017153Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.2017157Z 
2025-05-07T20:32:14.2017238Z     @given(
2025-05-07T20:32:14.2017448Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2017553Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2017673Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2017804Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2017922Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2018121Z     )
2025-05-07T20:32:14.2018386Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2018484Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2018564Z         self,
2025-05-07T20:32:14.2018650Z         T: int,
2025-05-07T20:32:14.2018728Z         D: int,
2025-05-07T20:32:14.2018834Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2018933Z         contiguous: bool,
2025-05-07T20:32:14.2019022Z         compiled: bool,
2025-05-07T20:32:14.2019107Z     ) -> None:
2025-05-07T20:32:14.2019206Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2019279Z     
2025-05-07T20:32:14.2019467Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2019542Z     
2025-05-07T20:32:14.2019639Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2019777Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2019869Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2019951Z         x0 = x[:, :D]
2025-05-07T20:32:14.2020039Z         x1 = x[:, D:]
2025-05-07T20:32:14.2020111Z     
2025-05-07T20:32:14.2020197Z         if contiguous:
2025-05-07T20:32:14.2020296Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2020438Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2020511Z     
2025-05-07T20:32:14.2020611Z         if scale_ub is not None:
2025-05-07T20:32:14.2020719Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2020866Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2020942Z             )
2025-05-07T20:32:14.2021019Z         else:
2025-05-07T20:32:14.2021122Z             scale_ub_tensor = None
2025-05-07T20:32:14.2021198Z     
2025-05-07T20:32:14.2021334Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2021432Z             op = silu_mul_quant
2025-05-07T20:32:14.2021564Z             if compiled:
2025-05-07T20:32:14.2021668Z                 op = torch.compile(op)
2025-05-07T20:32:14.2021782Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2021855Z     
2025-05-07T20:32:14.2021948Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2021959Z 
2025-05-07T20:32:14.2022063Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2022195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2022306Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2022409Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2022925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2023036Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2023405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2023641Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2023993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2024088Z     kernel = self.compile(
2025-05-07T20:32:14.2024489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2024674Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2024802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2024807Z 
2025-05-07T20:32:14.2025027Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9c566380>
2025-05-07T20:32:14.2025904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2026430Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9c9a64d0>}
2025-05-07T20:32:14.2027198Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2027406Z context = <triton._C.libtriton.ir.context object at 0x7faaebe03c70>
2025-05-07T20:32:14.2027410Z 
2025-05-07T20:32:14.2027583Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2027856Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2027977Z                            module_map=module_map)
2025-05-07T20:32:14.2028145Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2028251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2028334Z E       ^
2025-05-07T20:32:14.2028700Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2028704Z 
2025-05-07T20:32:14.2029141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2029186Z 
2025-05-07T20:32:14.2029295Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2029527Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2029613Z     T=1,
2025-05-07T20:32:14.2029691Z     D=7168,
2025-05-07T20:32:14.2029777Z     scale_ub=1200.0,
2025-05-07T20:32:14.2029874Z     contiguous=True,
2025-05-07T20:32:14.2029959Z     compiled=True,
2025-05-07T20:32:14.2030045Z )
2025-05-07T20:32:14.2030275Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2030448Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.2030495Z 
2025-05-07T20:32:14.2030582Z     @given(
2025-05-07T20:32:14.2030705Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2030807Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2030933Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2031061Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2031188Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2031266Z     )
2025-05-07T20:32:14.2031519Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2031622Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2031700Z         self,
2025-05-07T20:32:14.2031778Z         T: int,
2025-05-07T20:32:14.2031863Z         D: int,
2025-05-07T20:32:14.2031968Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2032064Z         contiguous: bool,
2025-05-07T20:32:14.2032159Z         compiled: bool,
2025-05-07T20:32:14.2032246Z     ) -> None:
2025-05-07T20:32:14.2032344Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2032441Z     
2025-05-07T20:32:14.2032643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2032722Z     
2025-05-07T20:32:14.2032824Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2032957Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2033054Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2033136Z         x0 = x[:, :D]
2025-05-07T20:32:14.2033218Z         x1 = x[:, D:]
2025-05-07T20:32:14.2033297Z     
2025-05-07T20:32:14.2033383Z         if contiguous:
2025-05-07T20:32:14.2033479Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2033575Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2033648Z     
2025-05-07T20:32:14.2033825Z         if scale_ub is not None:
2025-05-07T20:32:14.2033940Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2034080Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2034159Z             )
2025-05-07T20:32:14.2034242Z         else:
2025-05-07T20:32:14.2034339Z             scale_ub_tensor = None
2025-05-07T20:32:14.2034420Z     
2025-05-07T20:32:14.2034553Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2034646Z             op = silu_mul_quant
2025-05-07T20:32:14.2034755Z             if compiled:
2025-05-07T20:32:14.2034859Z                 op = torch.compile(op)
2025-05-07T20:32:14.2034973Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2035049Z     
2025-05-07T20:32:14.2035140Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2035145Z 
2025-05-07T20:32:14.2035251Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2035382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2035492Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2035601Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2035982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2036081Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2036594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2036695Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2037142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2037373Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2037724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2037826Z     kernel = self.compile(
2025-05-07T20:32:14.2038224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2038411Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2038579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2038584Z 
2025-05-07T20:32:14.2038796Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebea6d40>
2025-05-07T20:32:14.2039603Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2040126Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fac9c9a79a0>}
2025-05-07T20:32:14.2040902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2041105Z context = <triton._C.libtriton.ir.context object at 0x7faaebbf0430>
2025-05-07T20:32:14.2041110Z 
2025-05-07T20:32:14.2041281Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2041562Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2041675Z                            module_map=module_map)
2025-05-07T20:32:14.2041847Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2041953Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2042032Z E       ^
2025-05-07T20:32:14.2042402Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2042407Z 
2025-05-07T20:32:14.2042914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2042920Z 
2025-05-07T20:32:14.2043036Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2043269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2043350Z     T=1,
2025-05-07T20:32:14.2043440Z     D=7168,
2025-05-07T20:32:14.2043525Z     scale_ub=1200.0,
2025-05-07T20:32:14.2043614Z     contiguous=False,
2025-05-07T20:32:14.2043703Z     compiled=True,
2025-05-07T20:32:14.2043780Z )
2025-05-07T20:32:14.2044003Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2044181Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.2044186Z 
2025-05-07T20:32:14.2044265Z     @given(
2025-05-07T20:32:14.2044394Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2044498Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2044626Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2044752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2044870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2044949Z     )
2025-05-07T20:32:14.2045208Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2045303Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2045380Z         self,
2025-05-07T20:32:14.2045465Z         T: int,
2025-05-07T20:32:14.2045546Z         D: int,
2025-05-07T20:32:14.2045726Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2045824Z         contiguous: bool,
2025-05-07T20:32:14.2045912Z         compiled: bool,
2025-05-07T20:32:14.2045998Z     ) -> None:
2025-05-07T20:32:14.2046095Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2046168Z     
2025-05-07T20:32:14.2046350Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2046426Z     
2025-05-07T20:32:14.2046529Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2046665Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2046756Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2046884Z         x0 = x[:, :D]
2025-05-07T20:32:14.2046973Z         x1 = x[:, D:]
2025-05-07T20:32:14.2047047Z     
2025-05-07T20:32:14.2047132Z         if contiguous:
2025-05-07T20:32:14.2047235Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2047325Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2047408Z     
2025-05-07T20:32:14.2047506Z         if scale_ub is not None:
2025-05-07T20:32:14.2047614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2047762Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2047838Z             )
2025-05-07T20:32:14.2047916Z         else:
2025-05-07T20:32:14.2048024Z             scale_ub_tensor = None
2025-05-07T20:32:14.2048099Z     
2025-05-07T20:32:14.2048236Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2048344Z             op = silu_mul_quant
2025-05-07T20:32:14.2048430Z             if compiled:
2025-05-07T20:32:14.2048534Z                 op = torch.compile(op)
2025-05-07T20:32:14.2048651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2048727Z     
2025-05-07T20:32:14.2048826Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2048831Z 
2025-05-07T20:32:14.2048932Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2049062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2049173Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2049276Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2049655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2049756Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2050349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2050457Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2050823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2051055Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2051412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2051507Z     kernel = self.compile(
2025-05-07T20:32:14.2051903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2052088Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2052216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2052220Z 
2025-05-07T20:32:14.2052441Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebb4a0e0>
2025-05-07T20:32:14.2053244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2053768Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebbf4430>}
2025-05-07T20:32:14.2054540Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2054788Z context = <triton._C.libtriton.ir.context object at 0x7faaebbbf4b0>
2025-05-07T20:32:14.2054793Z 
2025-05-07T20:32:14.2054970Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2055246Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2055365Z                            module_map=module_map)
2025-05-07T20:32:14.2055536Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2056209Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2056315Z E       ^
2025-05-07T20:32:14.2056684Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2056689Z 
2025-05-07T20:32:14.2057118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2057127Z 
2025-05-07T20:32:14.2057247Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2057479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2057566Z     T=1,
2025-05-07T20:32:14.2057645Z     D=7168,
2025-05-07T20:32:14.2057734Z     scale_ub=None,
2025-05-07T20:32:14.2057832Z     contiguous=False,
2025-05-07T20:32:14.2057925Z     compiled=True,
2025-05-07T20:32:14.2058078Z )
2025-05-07T20:32:14.2058326Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2058506Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.2058511Z 
2025-05-07T20:32:14.2058593Z     @given(
2025-05-07T20:32:14.2058723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2058830Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2058958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2059079Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2059196Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2059278Z     )
2025-05-07T20:32:14.2059531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2059628Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2059715Z         self,
2025-05-07T20:32:14.2059941Z         T: int,
2025-05-07T20:32:14.2060023Z         D: int,
2025-05-07T20:32:14.2060136Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2060228Z         contiguous: bool,
2025-05-07T20:32:14.2060321Z         compiled: bool,
2025-05-07T20:32:14.2060408Z     ) -> None:
2025-05-07T20:32:14.2060507Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2060587Z     
2025-05-07T20:32:14.2060763Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2060839Z     
2025-05-07T20:32:14.2060947Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2061077Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2061168Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2061257Z         x0 = x[:, :D]
2025-05-07T20:32:14.2061339Z         x1 = x[:, D:]
2025-05-07T20:32:14.2061413Z     
2025-05-07T20:32:14.2061505Z         if contiguous:
2025-05-07T20:32:14.2061599Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2061691Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2061779Z     
2025-05-07T20:32:14.2061873Z         if scale_ub is not None:
2025-05-07T20:32:14.2061992Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2062136Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2062217Z             )
2025-05-07T20:32:14.2062301Z         else:
2025-05-07T20:32:14.2062401Z             scale_ub_tensor = None
2025-05-07T20:32:14.2062477Z     
2025-05-07T20:32:14.2062623Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2062784Z             op = silu_mul_quant
2025-05-07T20:32:14.2062872Z             if compiled:
2025-05-07T20:32:14.2062981Z                 op = torch.compile(op)
2025-05-07T20:32:14.2063091Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2063165Z     
2025-05-07T20:32:14.2063265Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.2063390Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.2063471Z     
2025-05-07T20:32:14.2063620Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2063726Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.2063835Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.2064012Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.2064160Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.2064243Z     
2025-05-07T20:32:14.2064349Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.2064354Z 
2025-05-07T20:32:14.2064461Z moe/activation_test.py:126: 
2025-05-07T20:32:14.2064594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2064705Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.2064851Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.2065428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.2065538Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.2065918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2066151Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2066528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.2066797Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.2067207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.2067475Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.2067861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.2068117Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.2068483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.2068570Z     fn()
2025-05-07T20:32:14.2068991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.2069080Z     self.fn.run(
2025-05-07T20:32:14.2069429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2069539Z     kernel = self.compile(
2025-05-07T20:32:14.2069930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2070113Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2070249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2070254Z 
2025-05-07T20:32:14.2070476Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebbde500>
2025-05-07T20:32:14.2071287Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2071812Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebbf4a60>}
2025-05-07T20:32:14.2072629Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2072830Z context = <triton._C.libtriton.ir.context object at 0x7faaebc0f530>
2025-05-07T20:32:14.2072834Z 
2025-05-07T20:32:14.2073012Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2073294Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2073449Z                            module_map=module_map)
2025-05-07T20:32:14.2073625Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2073735Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.2073816Z E       ^
2025-05-07T20:32:14.2074190Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2074197Z 
2025-05-07T20:32:14.2074626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2074631Z 
2025-05-07T20:32:14.2074742Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2074983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2075064Z     T=1,
2025-05-07T20:32:14.2075151Z     D=5120,
2025-05-07T20:32:14.2075243Z     scale_ub=1200.0,
2025-05-07T20:32:14.2075335Z     contiguous=False,
2025-05-07T20:32:14.2075429Z     compiled=True,
2025-05-07T20:32:14.2075509Z )
2025-05-07T20:32:14.2075737Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2075917Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.2075922Z 
2025-05-07T20:32:14.2076003Z     @given(
2025-05-07T20:32:14.2076128Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2076239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2076361Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2076492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2076614Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2076692Z     )
2025-05-07T20:32:14.2076952Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2077166Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2077249Z         self,
2025-05-07T20:32:14.2077338Z         T: int,
2025-05-07T20:32:14.2077418Z         D: int,
2025-05-07T20:32:14.2077525Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2077626Z         contiguous: bool,
2025-05-07T20:32:14.2077716Z         compiled: bool,
2025-05-07T20:32:14.2077799Z     ) -> None:
2025-05-07T20:32:14.2077904Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2077981Z     
2025-05-07T20:32:14.2078164Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2078245Z     
2025-05-07T20:32:14.2078342Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2078481Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2078575Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2078658Z         x0 = x[:, :D]
2025-05-07T20:32:14.2078748Z         x1 = x[:, D:]
2025-05-07T20:32:14.2078824Z     
2025-05-07T20:32:14.2078913Z         if contiguous:
2025-05-07T20:32:14.2079019Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2079112Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2079191Z     
2025-05-07T20:32:14.2079293Z         if scale_ub is not None:
2025-05-07T20:32:14.2079403Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2079550Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2079628Z             )
2025-05-07T20:32:14.2079708Z         else:
2025-05-07T20:32:14.2079815Z             scale_ub_tensor = None
2025-05-07T20:32:14.2079938Z     
2025-05-07T20:32:14.2080074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2080173Z             op = silu_mul_quant
2025-05-07T20:32:14.2080261Z             if compiled:
2025-05-07T20:32:14.2080364Z                 op = torch.compile(op)
2025-05-07T20:32:14.2080481Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2080558Z     
2025-05-07T20:32:14.2080653Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2080657Z 
2025-05-07T20:32:14.2080772Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2080904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2081064Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2081168Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2081549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2081656Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2082162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2082267Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2082643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2082873Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2083238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2083338Z     kernel = self.compile(
2025-05-07T20:32:14.2083735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2083923Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2084055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2084063Z 
2025-05-07T20:32:14.2084283Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebbac3d0>
2025-05-07T20:32:14.2085085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2085684Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebbf52d0>}
2025-05-07T20:32:14.2086460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2086662Z context = <triton._C.libtriton.ir.context object at 0x7faaebc9d170>
2025-05-07T20:32:14.2086667Z 
2025-05-07T20:32:14.2086847Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2087125Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2087238Z                            module_map=module_map)
2025-05-07T20:32:14.2087419Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2087524Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2087611Z E       ^
2025-05-07T20:32:14.2087980Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2087985Z 
2025-05-07T20:32:14.2088411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2088418Z 
2025-05-07T20:32:14.2088536Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2088767Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2088847Z     T=1,
2025-05-07T20:32:14.2088981Z     D=5120,
2025-05-07T20:32:14.2089071Z     scale_ub=1200.0,
2025-05-07T20:32:14.2089169Z     contiguous=False,
2025-05-07T20:32:14.2089256Z     compiled=False,
2025-05-07T20:32:14.2089333Z )
2025-05-07T20:32:14.2089565Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2089741Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.2089745Z 
2025-05-07T20:32:14.2089827Z     @given(
2025-05-07T20:32:14.2089964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2090069Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2090235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2090363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2090482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2090568Z     )
2025-05-07T20:32:14.2090819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2090919Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2091006Z         self,
2025-05-07T20:32:14.2091084Z         T: int,
2025-05-07T20:32:14.2091163Z         D: int,
2025-05-07T20:32:14.2091269Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2091360Z         contiguous: bool,
2025-05-07T20:32:14.2091447Z         compiled: bool,
2025-05-07T20:32:14.2091531Z     ) -> None:
2025-05-07T20:32:14.2091629Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2091703Z     
2025-05-07T20:32:14.2091885Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2091959Z     
2025-05-07T20:32:14.2092066Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2092196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2092287Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2092374Z         x0 = x[:, :D]
2025-05-07T20:32:14.2092454Z         x1 = x[:, D:]
2025-05-07T20:32:14.2092529Z     
2025-05-07T20:32:14.2092619Z         if contiguous:
2025-05-07T20:32:14.2092715Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2092805Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2092886Z     
2025-05-07T20:32:14.2092982Z         if scale_ub is not None:
2025-05-07T20:32:14.2093089Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2093236Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2093313Z             )
2025-05-07T20:32:14.2093401Z         else:
2025-05-07T20:32:14.2093587Z             scale_ub_tensor = None
2025-05-07T20:32:14.2093663Z     
2025-05-07T20:32:14.2093803Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2093898Z             op = silu_mul_quant
2025-05-07T20:32:14.2093985Z             if compiled:
2025-05-07T20:32:14.2094093Z                 op = torch.compile(op)
2025-05-07T20:32:14.2094201Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2094275Z     
2025-05-07T20:32:14.2094375Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2094382Z 
2025-05-07T20:32:14.2094483Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2094614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2094724Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2094826Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2095345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2095451Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2095819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2096055Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2096406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2096507Z     kernel = self.compile(
2025-05-07T20:32:14.2096899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2097127Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2097260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2097265Z 
2025-05-07T20:32:14.2097478Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebc6c0a0>
2025-05-07T20:32:14.2098474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2099148Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebbf5b40>}
2025-05-07T20:32:14.2099918Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2100176Z context = <triton._C.libtriton.ir.context object at 0x7faaeb816070>
2025-05-07T20:32:14.2100183Z 
2025-05-07T20:32:14.2100415Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2100747Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2100859Z                            module_map=module_map)
2025-05-07T20:32:14.2101025Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2101157Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2101237Z E       ^
2025-05-07T20:32:14.2101604Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2101608Z 
2025-05-07T20:32:14.2102048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2102055Z 
2025-05-07T20:32:14.2107925Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2108250Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2108369Z     T=16384,
2025-05-07T20:32:14.2108479Z     D=5120,
2025-05-07T20:32:14.2108603Z     scale_ub=1200.0,
2025-05-07T20:32:14.2108692Z     contiguous=False,
2025-05-07T20:32:14.2108923Z     compiled=True,
2025-05-07T20:32:14.2109003Z )
2025-05-07T20:32:14.2109230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2109423Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.2109429Z 
2025-05-07T20:32:14.2109508Z     @given(
2025-05-07T20:32:14.2109631Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2109740Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2109909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2110068Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2110196Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2110273Z     )
2025-05-07T20:32:14.2110537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2110635Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2110714Z         self,
2025-05-07T20:32:14.2110804Z         T: int,
2025-05-07T20:32:14.2110888Z         D: int,
2025-05-07T20:32:14.2111013Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2111155Z         contiguous: bool,
2025-05-07T20:32:14.2111284Z         compiled: bool,
2025-05-07T20:32:14.2111396Z     ) -> None:
2025-05-07T20:32:14.2111540Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2111647Z     
2025-05-07T20:32:14.2111879Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2111998Z     
2025-05-07T20:32:14.2112126Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2112414Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2112539Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2112624Z         x0 = x[:, :D]
2025-05-07T20:32:14.2112717Z         x1 = x[:, D:]
2025-05-07T20:32:14.2112792Z     
2025-05-07T20:32:14.2112881Z         if contiguous:
2025-05-07T20:32:14.2112989Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2113083Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2113163Z     
2025-05-07T20:32:14.2113264Z         if scale_ub is not None:
2025-05-07T20:32:14.2113374Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2113574Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2113660Z             )
2025-05-07T20:32:14.2113738Z         else:
2025-05-07T20:32:14.2113841Z             scale_ub_tensor = None
2025-05-07T20:32:14.2113915Z     
2025-05-07T20:32:14.2114050Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2114152Z             op = silu_mul_quant
2025-05-07T20:32:14.2114240Z             if compiled:
2025-05-07T20:32:14.2114342Z                 op = torch.compile(op)
2025-05-07T20:32:14.2114458Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2114534Z     
2025-05-07T20:32:14.2114627Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2114631Z 
2025-05-07T20:32:14.2114739Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2114874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2114986Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2115089Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2115478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2115582Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2116091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2116193Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2116567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2116797Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2117159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2117377Z     kernel = self.compile(
2025-05-07T20:32:14.2117772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2117969Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2118100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2118104Z 
2025-05-07T20:32:14.2118316Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb893130>
2025-05-07T20:32:14.2119130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2119652Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebbf6cb0>}
2025-05-07T20:32:14.2120433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2120634Z context = <triton._C.libtriton.ir.context object at 0x7faaeb857670>
2025-05-07T20:32:14.2120639Z 
2025-05-07T20:32:14.2120814Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2121090Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2121247Z                            module_map=module_map)
2025-05-07T20:32:14.2121422Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2121527Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2121607Z E       ^
2025-05-07T20:32:14.2121981Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2121986Z 
2025-05-07T20:32:14.2122420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2122424Z 
2025-05-07T20:32:14.2122637Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2122867Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2122945Z     T=2048,
2025-05-07T20:32:14.2123032Z     D=7168,
2025-05-07T20:32:14.2123118Z     scale_ub=1200.0,
2025-05-07T20:32:14.2123207Z     contiguous=False,
2025-05-07T20:32:14.2123304Z     compiled=True,
2025-05-07T20:32:14.2123380Z )
2025-05-07T20:32:14.2123612Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2123791Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.2123795Z 
2025-05-07T20:32:14.2123874Z     @given(
2025-05-07T20:32:14.2124004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2124111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2124229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2124358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2124477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2124558Z     )
2025-05-07T20:32:14.2124810Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2124905Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2124993Z         self,
2025-05-07T20:32:14.2125073Z         T: int,
2025-05-07T20:32:14.2125152Z         D: int,
2025-05-07T20:32:14.2125260Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2125350Z         contiguous: bool,
2025-05-07T20:32:14.2125436Z         compiled: bool,
2025-05-07T20:32:14.2125524Z     ) -> None:
2025-05-07T20:32:14.2125621Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2125694Z     
2025-05-07T20:32:14.2125867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2126035Z     
2025-05-07T20:32:14.2126129Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2126257Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2126355Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2126439Z         x0 = x[:, :D]
2025-05-07T20:32:14.2126524Z         x1 = x[:, D:]
2025-05-07T20:32:14.2126596Z     
2025-05-07T20:32:14.2126680Z         if contiguous:
2025-05-07T20:32:14.2126780Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2126868Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2126946Z     
2025-05-07T20:32:14.2127043Z         if scale_ub is not None:
2025-05-07T20:32:14.2127149Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2127286Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2127366Z             )
2025-05-07T20:32:14.2127442Z         else:
2025-05-07T20:32:14.2127536Z             scale_ub_tensor = None
2025-05-07T20:32:14.2127614Z     
2025-05-07T20:32:14.2127752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2127848Z             op = silu_mul_quant
2025-05-07T20:32:14.2127933Z             if compiled:
2025-05-07T20:32:14.2128037Z                 op = torch.compile(op)
2025-05-07T20:32:14.2128148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2128223Z     
2025-05-07T20:32:14.2128319Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2128324Z 
2025-05-07T20:32:14.2128429Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2128559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2128709Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2128816Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2129194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2129294Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2129805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2129905Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2130277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2130549Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2130897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2131001Z     kernel = self.compile(
2025-05-07T20:32:14.2131392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2131578Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2131704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2131709Z 
2025-05-07T20:32:14.2131925Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb8832b0>
2025-05-07T20:32:14.2132778Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2133306Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebbf7b50>}
2025-05-07T20:32:14.2134082Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2134279Z context = <triton._C.libtriton.ir.context object at 0x7faaebf81ff0>
2025-05-07T20:32:14.2134283Z 
2025-05-07T20:32:14.2134459Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2134812Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2134922Z                            module_map=module_map)
2025-05-07T20:32:14.2135096Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2135197Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2135273Z E       ^
2025-05-07T20:32:14.2135644Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2135652Z 
2025-05-07T20:32:14.2136076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2136081Z 
2025-05-07T20:32:14.2136193Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2136423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2136500Z     T=1,
2025-05-07T20:32:14.2136585Z     D=5120,
2025-05-07T20:32:14.2136668Z     scale_ub=None,
2025-05-07T20:32:14.2136762Z     contiguous=False,
2025-05-07T20:32:14.2136858Z     compiled=False,
2025-05-07T20:32:14.2136934Z )
2025-05-07T20:32:14.2137157Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2137340Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.2137345Z 
2025-05-07T20:32:14.2137424Z     @given(
2025-05-07T20:32:14.2137553Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2137655Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2137820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2137945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2138236Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2138339Z     )
2025-05-07T20:32:14.2138599Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2138698Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2138788Z         self,
2025-05-07T20:32:14.2138867Z         T: int,
2025-05-07T20:32:14.2138945Z         D: int,
2025-05-07T20:32:14.2139050Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2139192Z         contiguous: bool,
2025-05-07T20:32:14.2139280Z         compiled: bool,
2025-05-07T20:32:14.2139364Z     ) -> None:
2025-05-07T20:32:14.2139461Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2139535Z     
2025-05-07T20:32:14.2139713Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2139790Z     
2025-05-07T20:32:14.2139884Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2140017Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2140108Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2140189Z         x0 = x[:, :D]
2025-05-07T20:32:14.2140276Z         x1 = x[:, D:]
2025-05-07T20:32:14.2140349Z     
2025-05-07T20:32:14.2140440Z         if contiguous:
2025-05-07T20:32:14.2140533Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2140628Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2140709Z     
2025-05-07T20:32:14.2140802Z         if scale_ub is not None:
2025-05-07T20:32:14.2140913Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2141056Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2141134Z             )
2025-05-07T20:32:14.2141212Z         else:
2025-05-07T20:32:14.2141314Z             scale_ub_tensor = None
2025-05-07T20:32:14.2141391Z     
2025-05-07T20:32:14.2141524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2141626Z             op = silu_mul_quant
2025-05-07T20:32:14.2141712Z             if compiled:
2025-05-07T20:32:14.2141819Z                 op = torch.compile(op)
2025-05-07T20:32:14.2141931Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2142004Z     
2025-05-07T20:32:14.2142103Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2142107Z 
2025-05-07T20:32:14.2142207Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2142433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2142546Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2142652Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2143166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2143277Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2143646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2143885Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2144236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2144333Z     kernel = self.compile(
2025-05-07T20:32:14.2144742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2144921Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2145054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2145060Z 
2025-05-07T20:32:14.2145272Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb8834f0>
2025-05-07T20:32:14.2146073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2146643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebfe85e0>}
2025-05-07T20:32:14.2147414Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2147619Z context = <triton._C.libtriton.ir.context object at 0x7faaebfdbbf0>
2025-05-07T20:32:14.2147699Z 
2025-05-07T20:32:14.2147872Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2148148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2148265Z                            module_map=module_map)
2025-05-07T20:32:14.2148433Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2148544Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2148624Z E       ^
2025-05-07T20:32:14.2148987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2148991Z 
2025-05-07T20:32:14.2149423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2149432Z 
2025-05-07T20:32:14.2149540Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2149776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2149859Z     T=4096,
2025-05-07T20:32:14.2149936Z     D=7168,
2025-05-07T20:32:14.2150027Z     scale_ub=1200.0,
2025-05-07T20:32:14.2150120Z     contiguous=False,
2025-05-07T20:32:14.2150205Z     compiled=False,
2025-05-07T20:32:14.2150287Z )
2025-05-07T20:32:14.2150511Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2150695Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.2150700Z 
2025-05-07T20:32:14.2150784Z     @given(
2025-05-07T20:32:14.2150910Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2151018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2151138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2151348Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2151473Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2151549Z     )
2025-05-07T20:32:14.2151804Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2151905Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2151984Z         self,
2025-05-07T20:32:14.2152062Z         T: int,
2025-05-07T20:32:14.2152147Z         D: int,
2025-05-07T20:32:14.2152254Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2152367Z         contiguous: bool,
2025-05-07T20:32:14.2152466Z         compiled: bool,
2025-05-07T20:32:14.2152564Z     ) -> None:
2025-05-07T20:32:14.2152674Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2152747Z     
2025-05-07T20:32:14.2152921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2153003Z     
2025-05-07T20:32:14.2153097Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2153230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2153326Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2153407Z         x0 = x[:, :D]
2025-05-07T20:32:14.2153490Z         x1 = x[:, D:]
2025-05-07T20:32:14.2153572Z     
2025-05-07T20:32:14.2153656Z         if contiguous:
2025-05-07T20:32:14.2153749Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2153845Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2153918Z     
2025-05-07T20:32:14.2154010Z         if scale_ub is not None:
2025-05-07T20:32:14.2154123Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2154309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2154394Z             )
2025-05-07T20:32:14.2154471Z         else:
2025-05-07T20:32:14.2154567Z             scale_ub_tensor = None
2025-05-07T20:32:14.2154646Z     
2025-05-07T20:32:14.2154778Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2154868Z             op = silu_mul_quant
2025-05-07T20:32:14.2154969Z             if compiled:
2025-05-07T20:32:14.2155071Z                 op = torch.compile(op)
2025-05-07T20:32:14.2155179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2155302Z     
2025-05-07T20:32:14.2155395Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2155400Z 
2025-05-07T20:32:14.2155506Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2156017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2156128Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2156242Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2156759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2156859Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2157234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2157467Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2157824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2157923Z     kernel = self.compile(
2025-05-07T20:32:14.2158314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2158498Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2158623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2158631Z 
2025-05-07T20:32:14.2158841Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebf504c0>
2025-05-07T20:32:14.2159646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2160401Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebfe8ca0>}
2025-05-07T20:32:14.2161184Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2161382Z context = <triton._C.libtriton.ir.context object at 0x7faaebfbacf0>
2025-05-07T20:32:14.2161388Z 
2025-05-07T20:32:14.2161562Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2161834Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2161942Z                            module_map=module_map)
2025-05-07T20:32:14.2162117Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2162222Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2162306Z E       ^
2025-05-07T20:32:14.2162676Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2162683Z 
2025-05-07T20:32:14.2163110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2163115Z 
2025-05-07T20:32:14.2163230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2163460Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2163603Z     T=16384,
2025-05-07T20:32:14.2163690Z     D=7168,
2025-05-07T20:32:14.2163775Z     scale_ub=None,
2025-05-07T20:32:14.2163863Z     contiguous=True,
2025-05-07T20:32:14.2163955Z     compiled=True,
2025-05-07T20:32:14.2164031Z )
2025-05-07T20:32:14.2164260Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2164439Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.2164450Z 
2025-05-07T20:32:14.2164530Z     @given(
2025-05-07T20:32:14.2164657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2164827Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2164946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2165072Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2165191Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2165270Z     )
2025-05-07T20:32:14.2165528Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2165627Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2165712Z         self,
2025-05-07T20:32:14.2165791Z         T: int,
2025-05-07T20:32:14.2165869Z         D: int,
2025-05-07T20:32:14.2165977Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2166068Z         contiguous: bool,
2025-05-07T20:32:14.2166155Z         compiled: bool,
2025-05-07T20:32:14.2166242Z     ) -> None:
2025-05-07T20:32:14.2166342Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2166416Z     
2025-05-07T20:32:14.2166595Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2166672Z     
2025-05-07T20:32:14.2166768Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2166903Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2166992Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2167079Z         x0 = x[:, :D]
2025-05-07T20:32:14.2167159Z         x1 = x[:, D:]
2025-05-07T20:32:14.2167236Z     
2025-05-07T20:32:14.2167327Z         if contiguous:
2025-05-07T20:32:14.2167419Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2167510Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2167589Z     
2025-05-07T20:32:14.2167682Z         if scale_ub is not None:
2025-05-07T20:32:14.2167789Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2167933Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2168089Z             )
2025-05-07T20:32:14.2168169Z         else:
2025-05-07T20:32:14.2168286Z             scale_ub_tensor = None
2025-05-07T20:32:14.2168360Z     
2025-05-07T20:32:14.2168504Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2168596Z             op = silu_mul_quant
2025-05-07T20:32:14.2168682Z             if compiled:
2025-05-07T20:32:14.2168790Z                 op = torch.compile(op)
2025-05-07T20:32:14.2168898Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2168980Z     
2025-05-07T20:32:14.2169072Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2169077Z 
2025-05-07T20:32:14.2169180Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2169316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2169421Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2169523Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2169911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2170008Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2170523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2170628Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2170995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2171231Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2171625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2171720Z     kernel = self.compile(
2025-05-07T20:32:14.2172117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2172297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2172440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2172445Z 
2025-05-07T20:32:14.2172699Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebfbefe0>
2025-05-07T20:32:14.2173499Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2174028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebfe9b40>}
2025-05-07T20:32:14.2174796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2175004Z context = <triton._C.libtriton.ir.context object at 0x7faaeba938b0>
2025-05-07T20:32:14.2175008Z 
2025-05-07T20:32:14.2175182Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2175456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2175578Z                            module_map=module_map)
2025-05-07T20:32:14.2175745Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2175851Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2175934Z E       ^
2025-05-07T20:32:14.2176298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2176303Z 
2025-05-07T20:32:14.2176732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2176737Z 
2025-05-07T20:32:14.2176844Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2177162Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2177243Z     T=4096,
2025-05-07T20:32:14.2177321Z     D=5120,
2025-05-07T20:32:14.2177412Z     scale_ub=None,
2025-05-07T20:32:14.2177500Z     contiguous=False,
2025-05-07T20:32:14.2177584Z     compiled=True,
2025-05-07T20:32:14.2177663Z )
2025-05-07T20:32:14.2177887Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2178154Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.2178162Z 
2025-05-07T20:32:14.2178249Z     @given(
2025-05-07T20:32:14.2178372Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2178483Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2178600Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2178720Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2178842Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2178924Z     )
2025-05-07T20:32:14.2179175Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2179277Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2179360Z         self,
2025-05-07T20:32:14.2179436Z         T: int,
2025-05-07T20:32:14.2179520Z         D: int,
2025-05-07T20:32:14.2179620Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2179711Z         contiguous: bool,
2025-05-07T20:32:14.2179805Z         compiled: bool,
2025-05-07T20:32:14.2179884Z     ) -> None:
2025-05-07T20:32:14.2180111Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2180184Z     
2025-05-07T20:32:14.2180357Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2180439Z     
2025-05-07T20:32:14.2180533Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2180661Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2180761Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2180843Z         x0 = x[:, :D]
2025-05-07T20:32:14.2180929Z         x1 = x[:, D:]
2025-05-07T20:32:14.2181008Z     
2025-05-07T20:32:14.2181093Z         if contiguous:
2025-05-07T20:32:14.2181185Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2181330Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2181403Z     
2025-05-07T20:32:14.2181503Z         if scale_ub is not None:
2025-05-07T20:32:14.2181613Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2181751Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2181838Z             )
2025-05-07T20:32:14.2181915Z         else:
2025-05-07T20:32:14.2182010Z             scale_ub_tensor = None
2025-05-07T20:32:14.2182089Z     
2025-05-07T20:32:14.2182224Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2182317Z             op = silu_mul_quant
2025-05-07T20:32:14.2182410Z             if compiled:
2025-05-07T20:32:14.2182513Z                 op = torch.compile(op)
2025-05-07T20:32:14.2182625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2182707Z     
2025-05-07T20:32:14.2182799Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2182804Z 
2025-05-07T20:32:14.2182913Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2183043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2183147Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2183256Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2183636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2183736Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2184250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2184352Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2184728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2185037Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2185390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2185499Z     kernel = self.compile(
2025-05-07T20:32:14.2185893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2186075Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2186213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2186218Z 
2025-05-07T20:32:14.2186429Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebf62a10>
2025-05-07T20:32:14.2187241Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2187761Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebfe9240>}
2025-05-07T20:32:14.2188537Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2188736Z context = <triton._C.libtriton.ir.context object at 0x7faaeba4e470>
2025-05-07T20:32:14.2188783Z 
2025-05-07T20:32:14.2188956Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2189235Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2189348Z                            module_map=module_map)
2025-05-07T20:32:14.2189523Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2189631Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2189713Z E       ^
2025-05-07T20:32:14.2190085Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2190131Z 
2025-05-07T20:32:14.2190556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2190560Z 
2025-05-07T20:32:14.2190671Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2190910Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2190994Z     T=4096,
2025-05-07T20:32:14.2191082Z     D=5120,
2025-05-07T20:32:14.2191169Z     scale_ub=1200.0,
2025-05-07T20:32:14.2191260Z     contiguous=False,
2025-05-07T20:32:14.2191358Z     compiled=False,
2025-05-07T20:32:14.2191434Z )
2025-05-07T20:32:14.2191657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2191850Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.2191855Z 
2025-05-07T20:32:14.2191935Z     @given(
2025-05-07T20:32:14.2192062Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2192175Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2192297Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2192433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2192571Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2192666Z     )
2025-05-07T20:32:14.2192932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2193029Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2193109Z         self,
2025-05-07T20:32:14.2193196Z         T: int,
2025-05-07T20:32:14.2193275Z         D: int,
2025-05-07T20:32:14.2193380Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2193479Z         contiguous: bool,
2025-05-07T20:32:14.2193647Z         compiled: bool,
2025-05-07T20:32:14.2193730Z     ) -> None:
2025-05-07T20:32:14.2193835Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2193911Z     
2025-05-07T20:32:14.2194095Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2194171Z     
2025-05-07T20:32:14.2194267Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2194404Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2194497Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2194579Z         x0 = x[:, :D]
2025-05-07T20:32:14.2194670Z         x1 = x[:, D:]
2025-05-07T20:32:14.2194744Z     
2025-05-07T20:32:14.2194831Z         if contiguous:
2025-05-07T20:32:14.2194933Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2195026Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2195100Z     
2025-05-07T20:32:14.2195198Z         if scale_ub is not None:
2025-05-07T20:32:14.2195307Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2195458Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2195538Z             )
2025-05-07T20:32:14.2195621Z         else:
2025-05-07T20:32:14.2195724Z             scale_ub_tensor = None
2025-05-07T20:32:14.2195801Z     
2025-05-07T20:32:14.2195935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2196037Z             op = silu_mul_quant
2025-05-07T20:32:14.2196126Z             if compiled:
2025-05-07T20:32:14.2196234Z                 op = torch.compile(op)
2025-05-07T20:32:14.2196349Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2196470Z     
2025-05-07T20:32:14.2196564Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2196568Z 
2025-05-07T20:32:14.2196678Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2196808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2196919Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2197024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2197542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2197690Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2198058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2198287Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2198647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2198749Z     kernel = self.compile(
2025-05-07T20:32:14.2199151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2199332Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2199461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2199471Z 
2025-05-07T20:32:14.2199692Z self = <triton.compiler.compiler.ASTSource object at 0x7faaebac0850>
2025-05-07T20:32:14.2200492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2201019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebfeacb0>}
2025-05-07T20:32:14.2201790Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2201995Z context = <triton._C.libtriton.ir.context object at 0x7faaeb9af6f0>
2025-05-07T20:32:14.2202000Z 
2025-05-07T20:32:14.2202250Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2202528Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2202667Z                            module_map=module_map)
2025-05-07T20:32:14.2202862Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2202975Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2203063Z E       ^
2025-05-07T20:32:14.2203426Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2203434Z 
2025-05-07T20:32:14.2203866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2203870Z 
2025-05-07T20:32:14.2203980Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2204212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2204300Z     T=4096,
2025-05-07T20:32:14.2204385Z     D=5120,
2025-05-07T20:32:14.2204473Z     scale_ub=1200.0,
2025-05-07T20:32:14.2204569Z     contiguous=False,
2025-05-07T20:32:14.2204658Z     compiled=True,
2025-05-07T20:32:14.2204737Z )
2025-05-07T20:32:14.2204967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2205147Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.2205151Z 
2025-05-07T20:32:14.2205242Z     @given(
2025-05-07T20:32:14.2205366Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2205513Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2205644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2205767Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2205884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2205972Z     )
2025-05-07T20:32:14.2206228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2206334Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2206412Z         self,
2025-05-07T20:32:14.2206493Z         T: int,
2025-05-07T20:32:14.2206622Z         D: int,
2025-05-07T20:32:14.2206725Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2206819Z         contiguous: bool,
2025-05-07T20:32:14.2206915Z         compiled: bool,
2025-05-07T20:32:14.2206995Z     ) -> None:
2025-05-07T20:32:14.2207093Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2207175Z     
2025-05-07T20:32:14.2207353Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2207430Z     
2025-05-07T20:32:14.2207534Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2207664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2207762Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2207847Z         x0 = x[:, :D]
2025-05-07T20:32:14.2207930Z         x1 = x[:, D:]
2025-05-07T20:32:14.2208012Z     
2025-05-07T20:32:14.2208101Z         if contiguous:
2025-05-07T20:32:14.2208195Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2208292Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2208366Z     
2025-05-07T20:32:14.2208467Z         if scale_ub is not None:
2025-05-07T20:32:14.2208584Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2208723Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2208805Z             )
2025-05-07T20:32:14.2208890Z         else:
2025-05-07T20:32:14.2208987Z             scale_ub_tensor = None
2025-05-07T20:32:14.2209067Z     
2025-05-07T20:32:14.2209208Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2209301Z             op = silu_mul_quant
2025-05-07T20:32:14.2209395Z             if compiled:
2025-05-07T20:32:14.2209498Z                 op = torch.compile(op)
2025-05-07T20:32:14.2209608Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2209689Z     
2025-05-07T20:32:14.2209785Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2209901Z 
2025-05-07T20:32:14.2210006Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2210145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2210252Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2210357Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2210844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2210982Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2211536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2211638Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2212006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2212244Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2212602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2212708Z     kernel = self.compile(
2025-05-07T20:32:14.2213188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2213374Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2213513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2213590Z 
2025-05-07T20:32:14.2213806Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeba726e0>
2025-05-07T20:32:14.2214701Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2215318Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaebfeab90>}
2025-05-07T20:32:14.2216092Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2216358Z context = <triton._C.libtriton.ir.context object at 0x7faaeb957eb0>
2025-05-07T20:32:14.2216364Z 
2025-05-07T20:32:14.2216538Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2216823Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2216934Z                            module_map=module_map)
2025-05-07T20:32:14.2217102Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2217210Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2217288Z E       ^
2025-05-07T20:32:14.2217658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2217670Z 
2025-05-07T20:32:14.2218242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2218249Z 
2025-05-07T20:32:14.2218358Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2218595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2218675Z     T=2048,
2025-05-07T20:32:14.2218754Z     D=7168,
2025-05-07T20:32:14.2218848Z     scale_ub=1200.0,
2025-05-07T20:32:14.2218935Z     contiguous=False,
2025-05-07T20:32:14.2219023Z     compiled=False,
2025-05-07T20:32:14.2219105Z )
2025-05-07T20:32:14.2219331Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2219520Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.2219524Z 
2025-05-07T20:32:14.2219697Z     @given(
2025-05-07T20:32:14.2219821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2219930Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2220052Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2220176Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2220302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2220378Z     )
2025-05-07T20:32:14.2220628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2220732Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2220811Z         self,
2025-05-07T20:32:14.2220895Z         T: int,
2025-05-07T20:32:14.2220972Z         D: int,
2025-05-07T20:32:14.2221074Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2221172Z         contiguous: bool,
2025-05-07T20:32:14.2221264Z         compiled: bool,
2025-05-07T20:32:14.2221344Z     ) -> None:
2025-05-07T20:32:14.2221449Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2221529Z     
2025-05-07T20:32:14.2221703Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2221790Z     
2025-05-07T20:32:14.2221885Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2222014Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2222115Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2222200Z         x0 = x[:, :D]
2025-05-07T20:32:14.2222289Z         x1 = x[:, D:]
2025-05-07T20:32:14.2222363Z     
2025-05-07T20:32:14.2222495Z         if contiguous:
2025-05-07T20:32:14.2222594Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2222685Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2222760Z     
2025-05-07T20:32:14.2222862Z         if scale_ub is not None:
2025-05-07T20:32:14.2222970Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2223109Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2223193Z             )
2025-05-07T20:32:14.2223279Z         else:
2025-05-07T20:32:14.2223376Z             scale_ub_tensor = None
2025-05-07T20:32:14.2223455Z     
2025-05-07T20:32:14.2223588Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2223724Z             op = silu_mul_quant
2025-05-07T20:32:14.2223816Z             if compiled:
2025-05-07T20:32:14.2223918Z                 op = torch.compile(op)
2025-05-07T20:32:14.2224035Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2224108Z     
2025-05-07T20:32:14.2224200Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2224207Z 
2025-05-07T20:32:14.2224312Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2224442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2224546Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2224654Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2225169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2225279Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2225645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2225874Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2226233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2226329Z     kernel = self.compile(
2025-05-07T20:32:14.2226722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2226908Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2227034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2227039Z 
2025-05-07T20:32:14.2227257Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb958a60>
2025-05-07T20:32:14.2228137Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2228660Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb9c05e0>}
2025-05-07T20:32:14.2229431Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2229634Z context = <triton._C.libtriton.ir.context object at 0x7faaeb9ed2f0>
2025-05-07T20:32:14.2229639Z 
2025-05-07T20:32:14.2229813Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2230088Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2230203Z                            module_map=module_map)
2025-05-07T20:32:14.2230369Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2230473Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2230556Z E       ^
2025-05-07T20:32:14.2230918Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2230923Z 
2025-05-07T20:32:14.2231345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2231395Z 
2025-05-07T20:32:14.2231509Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2231739Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2231824Z     T=1,
2025-05-07T20:32:14.2231902Z     D=7168,
2025-05-07T20:32:14.2231985Z     scale_ub=None,
2025-05-07T20:32:14.2232077Z     contiguous=True,
2025-05-07T20:32:14.2232168Z     compiled=False,
2025-05-07T20:32:14.2232242Z )
2025-05-07T20:32:14.2232473Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2232712Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.2232718Z 
2025-05-07T20:32:14.2232820Z     @given(
2025-05-07T20:32:14.2232963Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2233065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2233194Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2233314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2239103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2239205Z     )
2025-05-07T20:32:14.2239472Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2239571Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2239657Z         self,
2025-05-07T20:32:14.2239744Z         T: int,
2025-05-07T20:32:14.2239828Z         D: int,
2025-05-07T20:32:14.2239932Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2240025Z         contiguous: bool,
2025-05-07T20:32:14.2240125Z         compiled: bool,
2025-05-07T20:32:14.2240205Z     ) -> None:
2025-05-07T20:32:14.2240304Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2240390Z     
2025-05-07T20:32:14.2240568Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2240642Z     
2025-05-07T20:32:14.2240750Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2240880Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2240974Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2241065Z         x0 = x[:, :D]
2025-05-07T20:32:14.2241146Z         x1 = x[:, D:]
2025-05-07T20:32:14.2241227Z     
2025-05-07T20:32:14.2241316Z         if contiguous:
2025-05-07T20:32:14.2241412Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2241514Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2241708Z     
2025-05-07T20:32:14.2241804Z         if scale_ub is not None:
2025-05-07T20:32:14.2241924Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2242066Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2242144Z             )
2025-05-07T20:32:14.2242235Z         else:
2025-05-07T20:32:14.2242333Z             scale_ub_tensor = None
2025-05-07T20:32:14.2242409Z     
2025-05-07T20:32:14.2242552Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2242649Z             op = silu_mul_quant
2025-05-07T20:32:14.2242736Z             if compiled:
2025-05-07T20:32:14.2242853Z                 op = torch.compile(op)
2025-05-07T20:32:14.2242963Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2243048Z     
2025-05-07T20:32:14.2243144Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2243149Z 
2025-05-07T20:32:14.2243255Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2243405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2243511Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2243617Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2244148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2244251Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2244633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2245004Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2245358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2245466Z     kernel = self.compile(
2025-05-07T20:32:14.2245861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2246050Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2246187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2246233Z 
2025-05-07T20:32:14.2246450Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb903c40>
2025-05-07T20:32:14.2247262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2247787Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb9c0d30>}
2025-05-07T20:32:14.2248572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2248773Z context = <triton._C.libtriton.ir.context object at 0x7faaeb738e30>
2025-05-07T20:32:14.2248778Z 
2025-05-07T20:32:14.2248955Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2249240Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2249357Z                            module_map=module_map)
2025-05-07T20:32:14.2249537Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2249643Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2249724Z E       ^
2025-05-07T20:32:14.2250109Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2250114Z 
2025-05-07T20:32:14.2250539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2250544Z 
2025-05-07T20:32:14.2250765Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2251009Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2251095Z     T=16384,
2025-05-07T20:32:14.2251182Z     D=7168,
2025-05-07T20:32:14.2251271Z     scale_ub=1200.0,
2025-05-07T20:32:14.2251363Z     contiguous=False,
2025-05-07T20:32:14.2251460Z     compiled=True,
2025-05-07T20:32:14.2251538Z )
2025-05-07T20:32:14.2251768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2251963Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.2251967Z 
2025-05-07T20:32:14.2252049Z     @given(
2025-05-07T20:32:14.2252177Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2252298Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2252429Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2252576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2252728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2252808Z     )
2025-05-07T20:32:14.2253073Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2253174Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2253257Z         self,
2025-05-07T20:32:14.2253345Z         T: int,
2025-05-07T20:32:14.2253425Z         D: int,
2025-05-07T20:32:14.2253529Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2253629Z         contiguous: bool,
2025-05-07T20:32:14.2253765Z         compiled: bool,
2025-05-07T20:32:14.2253847Z     ) -> None:
2025-05-07T20:32:14.2253955Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2254030Z     
2025-05-07T20:32:14.2254215Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2254291Z     
2025-05-07T20:32:14.2254387Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2254524Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2254623Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2254707Z         x0 = x[:, :D]
2025-05-07T20:32:14.2254798Z         x1 = x[:, D:]
2025-05-07T20:32:14.2254916Z     
2025-05-07T20:32:14.2255004Z         if contiguous:
2025-05-07T20:32:14.2255108Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2255202Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2255276Z     
2025-05-07T20:32:14.2255379Z         if scale_ub is not None:
2025-05-07T20:32:14.2255489Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2256025Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2256132Z             )
2025-05-07T20:32:14.2256215Z         else:
2025-05-07T20:32:14.2256314Z             scale_ub_tensor = None
2025-05-07T20:32:14.2256397Z     
2025-05-07T20:32:14.2256532Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2256627Z             op = silu_mul_quant
2025-05-07T20:32:14.2256718Z             if compiled:
2025-05-07T20:32:14.2256828Z                 op = torch.compile(op)
2025-05-07T20:32:14.2256939Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2257022Z     
2025-05-07T20:32:14.2257119Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2257124Z 
2025-05-07T20:32:14.2257226Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2257366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2257471Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2257580Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2257963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2258161Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2258677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2258779Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2259379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2259620Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2259977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2260082Z     kernel = self.compile(
2025-05-07T20:32:14.2260478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2260662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2260802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2260806Z 
2025-05-07T20:32:14.2261022Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb744670>
2025-05-07T20:32:14.2261837Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2262360Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb9c1bd0>}
2025-05-07T20:32:14.2263190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2263453Z context = <triton._C.libtriton.ir.context object at 0x7faaeb755730>
2025-05-07T20:32:14.2263458Z 
2025-05-07T20:32:14.2263629Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2263910Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2264021Z                            module_map=module_map)
2025-05-07T20:32:14.2264197Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2264310Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2264391Z E       ^
2025-05-07T20:32:14.2264834Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2264839Z 
2025-05-07T20:32:14.2265266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2265274Z 
2025-05-07T20:32:14.2265386Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2265624Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2265706Z     T=1,
2025-05-07T20:32:14.2265785Z     D=7168,
2025-05-07T20:32:14.2265878Z     scale_ub=None,
2025-05-07T20:32:14.2265969Z     contiguous=False,
2025-05-07T20:32:14.2266067Z     compiled=False,
2025-05-07T20:32:14.2266142Z )
2025-05-07T20:32:14.2266376Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2266562Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.2266570Z 
2025-05-07T20:32:14.2266649Z     @given(
2025-05-07T20:32:14.2266773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2266884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2267004Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2267125Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2267257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2267333Z     )
2025-05-07T20:32:14.2267595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2267692Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2267771Z         self,
2025-05-07T20:32:14.2267860Z         T: int,
2025-05-07T20:32:14.2267939Z         D: int,
2025-05-07T20:32:14.2268042Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2268226Z         contiguous: bool,
2025-05-07T20:32:14.2268318Z         compiled: bool,
2025-05-07T20:32:14.2268401Z     ) -> None:
2025-05-07T20:32:14.2268509Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2268583Z     
2025-05-07T20:32:14.2268759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2268842Z     
2025-05-07T20:32:14.2268938Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2269074Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2269168Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2269250Z         x0 = x[:, :D]
2025-05-07T20:32:14.2269338Z         x1 = x[:, D:]
2025-05-07T20:32:14.2269413Z     
2025-05-07T20:32:14.2269498Z         if contiguous:
2025-05-07T20:32:14.2269602Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2269694Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2269769Z     
2025-05-07T20:32:14.2269869Z         if scale_ub is not None:
2025-05-07T20:32:14.2269986Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2270126Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2270210Z             )
2025-05-07T20:32:14.2270290Z         else:
2025-05-07T20:32:14.2270393Z             scale_ub_tensor = None
2025-05-07T20:32:14.2270469Z     
2025-05-07T20:32:14.2270604Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2270702Z             op = silu_mul_quant
2025-05-07T20:32:14.2270790Z             if compiled:
2025-05-07T20:32:14.2270941Z                 op = torch.compile(op)
2025-05-07T20:32:14.2271056Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2271132Z     
2025-05-07T20:32:14.2271227Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2271231Z 
2025-05-07T20:32:14.2271336Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2271468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2271579Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2271690Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2272204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2272356Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2272729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2272958Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2273321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2273418Z     kernel = self.compile(
2025-05-07T20:32:14.2273818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2274000Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2274133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2274138Z 
2025-05-07T20:32:14.2274357Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb7f09a0>
2025-05-07T20:32:14.2275162Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2275689Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb9c2050>}
2025-05-07T20:32:14.2276459Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2276658Z context = <triton._C.libtriton.ir.context object at 0x7fac9c08aff0>
2025-05-07T20:32:14.2276740Z 
2025-05-07T20:32:14.2276920Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2277195Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2277315Z                            module_map=module_map)
2025-05-07T20:32:14.2277485Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2277593Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2277677Z E       ^
2025-05-07T20:32:14.2278046Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2278050Z 
2025-05-07T20:32:14.2278485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2278490Z 
2025-05-07T20:32:14.2278599Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2278837Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2278923Z     T=2048,
2025-05-07T20:32:14.2279002Z     D=7168,
2025-05-07T20:32:14.2279088Z     scale_ub=None,
2025-05-07T20:32:14.2279187Z     contiguous=False,
2025-05-07T20:32:14.2279274Z     compiled=True,
2025-05-07T20:32:14.2279350Z )
2025-05-07T20:32:14.2279582Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2279762Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.2279766Z 
2025-05-07T20:32:14.2279901Z     @given(
2025-05-07T20:32:14.2280023Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2280127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2280254Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2280376Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2280494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2280581Z     )
2025-05-07T20:32:14.2280840Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2280938Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2281106Z         self,
2025-05-07T20:32:14.2281186Z         T: int,
2025-05-07T20:32:14.2281274Z         D: int,
2025-05-07T20:32:14.2281377Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2281471Z         contiguous: bool,
2025-05-07T20:32:14.2281565Z         compiled: bool,
2025-05-07T20:32:14.2281647Z     ) -> None:
2025-05-07T20:32:14.2281746Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2281832Z     
2025-05-07T20:32:14.2282009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2282086Z     
2025-05-07T20:32:14.2282191Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2282325Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2282430Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2282531Z         x0 = x[:, :D]
2025-05-07T20:32:14.2282630Z         x1 = x[:, D:]
2025-05-07T20:32:14.2282717Z     
2025-05-07T20:32:14.2282812Z         if contiguous:
2025-05-07T20:32:14.2282907Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2283007Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2283082Z     
2025-05-07T20:32:14.2283179Z         if scale_ub is not None:
2025-05-07T20:32:14.2283294Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2283434Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2283516Z             )
2025-05-07T20:32:14.2283604Z         else:
2025-05-07T20:32:14.2283701Z             scale_ub_tensor = None
2025-05-07T20:32:14.2283776Z     
2025-05-07T20:32:14.2283917Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2284009Z             op = silu_mul_quant
2025-05-07T20:32:14.2284097Z             if compiled:
2025-05-07T20:32:14.2284205Z                 op = torch.compile(op)
2025-05-07T20:32:14.2284314Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2284488Z     
2025-05-07T20:32:14.2284584Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2284589Z 
2025-05-07T20:32:14.2284690Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2284831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2284934Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2285037Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2285424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2285523Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2286040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2286141Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2286508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2286754Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2287104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2287204Z     kernel = self.compile(
2025-05-07T20:32:14.2287603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2287784Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2287919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2287967Z 
2025-05-07T20:32:14.2288181Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb7cc790>
2025-05-07T20:32:14.2288986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2289523Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb9c31c0>}
2025-05-07T20:32:14.2290333Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2290538Z context = <triton._C.libtriton.ir.context object at 0x7fac9c040130>
2025-05-07T20:32:14.2290545Z 
2025-05-07T20:32:14.2290717Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2290995Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2291105Z                            module_map=module_map)
2025-05-07T20:32:14.2291270Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2291379Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2291460Z E       ^
2025-05-07T20:32:14.2291825Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2291833Z 
2025-05-07T20:32:14.2292264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2292268Z 
2025-05-07T20:32:14.2292374Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2292618Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2292719Z     T=4096,
2025-05-07T20:32:14.2292803Z     D=7168,
2025-05-07T20:32:14.2292913Z     scale_ub=None,
2025-05-07T20:32:14.2293001Z     contiguous=False,
2025-05-07T20:32:14.2293088Z     compiled=True,
2025-05-07T20:32:14.2293167Z )
2025-05-07T20:32:14.2293393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2293651Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.2293664Z 
2025-05-07T20:32:14.2293744Z     @given(
2025-05-07T20:32:14.2293866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2293978Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2294096Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2294216Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2294341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2294419Z     )
2025-05-07T20:32:14.2294674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2294780Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2294858Z         self,
2025-05-07T20:32:14.2294937Z         T: int,
2025-05-07T20:32:14.2295020Z         D: int,
2025-05-07T20:32:14.2295121Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2295219Z         contiguous: bool,
2025-05-07T20:32:14.2295307Z         compiled: bool,
2025-05-07T20:32:14.2295390Z     ) -> None:
2025-05-07T20:32:14.2295494Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2295568Z     
2025-05-07T20:32:14.2295742Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2295826Z     
2025-05-07T20:32:14.2295922Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2296050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2296146Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2296227Z         x0 = x[:, :D]
2025-05-07T20:32:14.2296309Z         x1 = x[:, D:]
2025-05-07T20:32:14.2296435Z     
2025-05-07T20:32:14.2296520Z         if contiguous:
2025-05-07T20:32:14.2296613Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2296708Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2296781Z     
2025-05-07T20:32:14.2296879Z         if scale_ub is not None:
2025-05-07T20:32:14.2296986Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2297125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2297213Z             )
2025-05-07T20:32:14.2297290Z         else:
2025-05-07T20:32:14.2297386Z             scale_ub_tensor = None
2025-05-07T20:32:14.2297510Z     
2025-05-07T20:32:14.2297642Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2297735Z             op = silu_mul_quant
2025-05-07T20:32:14.2297826Z             if compiled:
2025-05-07T20:32:14.2297928Z                 op = torch.compile(op)
2025-05-07T20:32:14.2298177Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2298287Z     
2025-05-07T20:32:14.2298389Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2298394Z 
2025-05-07T20:32:14.2298501Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2298630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2298735Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2298843Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2299231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2299326Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2299861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2299965Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2300342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2300575Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2300939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2301036Z     kernel = self.compile(
2025-05-07T20:32:14.2301432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2301709Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2301841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2301848Z 
2025-05-07T20:32:14.2302072Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9c0dd960>
2025-05-07T20:32:14.2302882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2303409Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb6bc1f0>}
2025-05-07T20:32:14.2304193Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2304403Z context = <triton._C.libtriton.ir.context object at 0x7faaeb6a6cf0>
2025-05-07T20:32:14.2304408Z 
2025-05-07T20:32:14.2304590Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2304868Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2304981Z                            module_map=module_map)
2025-05-07T20:32:14.2305156Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2305259Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2305388Z E       ^
2025-05-07T20:32:14.2305835Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2305842Z 
2025-05-07T20:32:14.2306372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2306378Z 
2025-05-07T20:32:14.2306495Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2306740Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2306820Z     T=16384,
2025-05-07T20:32:14.2306908Z     D=5120,
2025-05-07T20:32:14.2307053Z     scale_ub=1200.0,
2025-05-07T20:32:14.2307150Z     contiguous=False,
2025-05-07T20:32:14.2307239Z     compiled=False,
2025-05-07T20:32:14.2307317Z )
2025-05-07T20:32:14.2307551Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2307747Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.2307755Z 
2025-05-07T20:32:14.2307835Z     @given(
2025-05-07T20:32:14.2307971Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2308075Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2308196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2308325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2308450Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2308532Z     )
2025-05-07T20:32:14.2308784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2308884Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2308968Z         self,
2025-05-07T20:32:14.2309047Z         T: int,
2025-05-07T20:32:14.2309125Z         D: int,
2025-05-07T20:32:14.2309235Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2309328Z         contiguous: bool,
2025-05-07T20:32:14.2309417Z         compiled: bool,
2025-05-07T20:32:14.2309505Z     ) -> None:
2025-05-07T20:32:14.2309605Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2309680Z     
2025-05-07T20:32:14.2309860Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2309936Z     
2025-05-07T20:32:14.2310039Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2310170Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2310261Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2310444Z         x0 = x[:, :D]
2025-05-07T20:32:14.2310527Z         x1 = x[:, D:]
2025-05-07T20:32:14.2310601Z     
2025-05-07T20:32:14.2310694Z         if contiguous:
2025-05-07T20:32:14.2310793Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2310885Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2310966Z     
2025-05-07T20:32:14.2311060Z         if scale_ub is not None:
2025-05-07T20:32:14.2311177Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2311325Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2311405Z             )
2025-05-07T20:32:14.2311484Z         else:
2025-05-07T20:32:14.2311588Z             scale_ub_tensor = None
2025-05-07T20:32:14.2311663Z     
2025-05-07T20:32:14.2311806Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2311898Z             op = silu_mul_quant
2025-05-07T20:32:14.2311984Z             if compiled:
2025-05-07T20:32:14.2312093Z                 op = torch.compile(op)
2025-05-07T20:32:14.2312210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2312286Z     
2025-05-07T20:32:14.2312387Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2312396Z 
2025-05-07T20:32:14.2312521Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2312670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2312787Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2312893Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2313425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2313664Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2314075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2314314Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2314676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2314775Z     kernel = self.compile(
2025-05-07T20:32:14.2315178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2315455Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2315630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2315636Z 
2025-05-07T20:32:14.2315858Z self = <triton.compiler.compiler.ASTSource object at 0x7fac9c0dc550>
2025-05-07T20:32:14.2316713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2317446Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb6bc700>}
2025-05-07T20:32:14.2318421Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2318638Z context = <triton._C.libtriton.ir.context object at 0x7faaeb6d40f0>
2025-05-07T20:32:14.2318644Z 
2025-05-07T20:32:14.2318817Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2319102Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2319213Z                            module_map=module_map)
2025-05-07T20:32:14.2319381Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2319495Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2319575Z E       ^
2025-05-07T20:32:14.2320064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2320070Z 
2025-05-07T20:32:14.2320510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2320518Z 
2025-05-07T20:32:14.2320626Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2320864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2320946Z     T=16384,
2025-05-07T20:32:14.2321027Z     D=5120,
2025-05-07T20:32:14.2321125Z     scale_ub=1200.0,
2025-05-07T20:32:14.2321213Z     contiguous=True,
2025-05-07T20:32:14.2321299Z     compiled=True,
2025-05-07T20:32:14.2321384Z )
2025-05-07T20:32:14.2321611Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2321794Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.2321806Z 
2025-05-07T20:32:14.2321886Z     @given(
2025-05-07T20:32:14.2322015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2322125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2322245Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2322370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2322499Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2322581Z     )
2025-05-07T20:32:14.2322883Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2323033Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2323111Z         self,
2025-05-07T20:32:14.2323190Z         T: int,
2025-05-07T20:32:14.2323275Z         D: int,
2025-05-07T20:32:14.2323378Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2323478Z         contiguous: bool,
2025-05-07T20:32:14.2323568Z         compiled: bool,
2025-05-07T20:32:14.2323650Z     ) -> None:
2025-05-07T20:32:14.2323757Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2323831Z     
2025-05-07T20:32:14.2324012Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2324100Z     
2025-05-07T20:32:14.2324194Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2324372Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2324470Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2324552Z         x0 = x[:, :D]
2025-05-07T20:32:14.2324633Z         x1 = x[:, D:]
2025-05-07T20:32:14.2324712Z     
2025-05-07T20:32:14.2324798Z         if contiguous:
2025-05-07T20:32:14.2324902Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2324994Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2325068Z     
2025-05-07T20:32:14.2325168Z         if scale_ub is not None:
2025-05-07T20:32:14.2325278Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2325418Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2325502Z             )
2025-05-07T20:32:14.2325580Z         else:
2025-05-07T20:32:14.2325683Z             scale_ub_tensor = None
2025-05-07T20:32:14.2325764Z     
2025-05-07T20:32:14.2325899Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2325995Z             op = silu_mul_quant
2025-05-07T20:32:14.2326089Z             if compiled:
2025-05-07T20:32:14.2326192Z                 op = torch.compile(op)
2025-05-07T20:32:14.2326307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2326381Z     
2025-05-07T20:32:14.2326474Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2326478Z 
2025-05-07T20:32:14.2326587Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2326721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2326825Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2326935Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2327317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2327413Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2328010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2328121Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2328495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2328726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2329077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2329183Z     kernel = self.compile(
2025-05-07T20:32:14.2329576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2329763Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2329892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2329902Z 
2025-05-07T20:32:14.2330117Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb689ab0>
2025-05-07T20:32:14.2330926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2331448Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb6bd7e0>}
2025-05-07T20:32:14.2332264Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2332468Z context = <triton._C.libtriton.ir.context object at 0x7faaeb6acd70>
2025-05-07T20:32:14.2332472Z 
2025-05-07T20:32:14.2332653Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2332933Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2333085Z                            module_map=module_map)
2025-05-07T20:32:14.2333258Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2333360Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2333439Z E       ^
2025-05-07T20:32:14.2333809Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2333817Z 
2025-05-07T20:32:14.2334244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2334249Z 
2025-05-07T20:32:14.2334366Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2334598Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2334684Z     T=16384,
2025-05-07T20:32:14.2334769Z     D=5120,
2025-05-07T20:32:14.2334854Z     scale_ub=None,
2025-05-07T20:32:14.2334945Z     contiguous=False,
2025-05-07T20:32:14.2335040Z     compiled=True,
2025-05-07T20:32:14.2335118Z )
2025-05-07T20:32:14.2335343Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2335534Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.2335539Z 
2025-05-07T20:32:14.2335621Z     @given(
2025-05-07T20:32:14.2335753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2335858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2335980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2336108Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2336228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2336305Z     )
2025-05-07T20:32:14.2336645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2336745Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2336825Z         self,
2025-05-07T20:32:14.2336914Z         T: int,
2025-05-07T20:32:14.2336996Z         D: int,
2025-05-07T20:32:14.2337103Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2337203Z         contiguous: bool,
2025-05-07T20:32:14.2337293Z         compiled: bool,
2025-05-07T20:32:14.2337382Z     ) -> None:
2025-05-07T20:32:14.2337481Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2337556Z     
2025-05-07T20:32:14.2337740Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2337817Z     
2025-05-07T20:32:14.2337914Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2338180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2338291Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2338398Z         x0 = x[:, :D]
2025-05-07T20:32:14.2338488Z         x1 = x[:, D:]
2025-05-07T20:32:14.2338564Z     
2025-05-07T20:32:14.2338660Z         if contiguous:
2025-05-07T20:32:14.2338762Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2338854Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2338939Z     
2025-05-07T20:32:14.2339032Z         if scale_ub is not None:
2025-05-07T20:32:14.2339141Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2339286Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2339365Z             )
2025-05-07T20:32:14.2339444Z         else:
2025-05-07T20:32:14.2339596Z             scale_ub_tensor = None
2025-05-07T20:32:14.2339671Z     
2025-05-07T20:32:14.2339805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2339904Z             op = silu_mul_quant
2025-05-07T20:32:14.2339992Z             if compiled:
2025-05-07T20:32:14.2340094Z                 op = torch.compile(op)
2025-05-07T20:32:14.2340209Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2340283Z     
2025-05-07T20:32:14.2340387Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2340391Z 
2025-05-07T20:32:14.2340493Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2340625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2340782Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2340887Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2341266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2341368Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2341880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2341989Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2342363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2342597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2342955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2343056Z     kernel = self.compile(
2025-05-07T20:32:14.2343450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2343637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2343766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2343773Z 
2025-05-07T20:32:14.2343994Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb622140>
2025-05-07T20:32:14.2344797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2345430Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb6be680>}
2025-05-07T20:32:14.2346210Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2346409Z context = <triton._C.libtriton.ir.context object at 0x7faaeb5af370>
2025-05-07T20:32:14.2346413Z 
2025-05-07T20:32:14.2346593Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2346864Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2346980Z                            module_map=module_map)
2025-05-07T20:32:14.2347145Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2347247Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2347335Z E       ^
2025-05-07T20:32:14.2347704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2347709Z 
2025-05-07T20:32:14.2348135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2348140Z 
2025-05-07T20:32:14.2348254Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2348486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2348615Z     T=2048,
2025-05-07T20:32:14.2348693Z     D=5120,
2025-05-07T20:32:14.2348780Z     scale_ub=None,
2025-05-07T20:32:14.2348876Z     contiguous=False,
2025-05-07T20:32:14.2348962Z     compiled=True,
2025-05-07T20:32:14.2349040Z )
2025-05-07T20:32:14.2349272Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2349451Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.2349456Z 
2025-05-07T20:32:14.2349541Z     @given(
2025-05-07T20:32:14.2349670Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2349774Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2349942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2350065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2350183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2350270Z     )
2025-05-07T20:32:14.2350524Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2350624Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2350711Z         self,
2025-05-07T20:32:14.2350791Z         T: int,
2025-05-07T20:32:14.2350869Z         D: int,
2025-05-07T20:32:14.2350978Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2351070Z         contiguous: bool,
2025-05-07T20:32:14.2351159Z         compiled: bool,
2025-05-07T20:32:14.2351250Z     ) -> None:
2025-05-07T20:32:14.2351352Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2351435Z     
2025-05-07T20:32:14.2351611Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2351689Z     
2025-05-07T20:32:14.2351792Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2351923Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2352014Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2352103Z         x0 = x[:, :D]
2025-05-07T20:32:14.2352188Z         x1 = x[:, D:]
2025-05-07T20:32:14.2352263Z     
2025-05-07T20:32:14.2352362Z         if contiguous:
2025-05-07T20:32:14.2352456Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2352547Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2352629Z     
2025-05-07T20:32:14.2352724Z         if scale_ub is not None:
2025-05-07T20:32:14.2352841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2352980Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2353057Z             )
2025-05-07T20:32:14.2353232Z         else:
2025-05-07T20:32:14.2353333Z             scale_ub_tensor = None
2025-05-07T20:32:14.2353409Z     
2025-05-07T20:32:14.2353549Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2353644Z             op = silu_mul_quant
2025-05-07T20:32:14.2353730Z             if compiled:
2025-05-07T20:32:14.2353838Z                 op = torch.compile(op)
2025-05-07T20:32:14.2353949Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2354023Z     
2025-05-07T20:32:14.2354122Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2354130Z 
2025-05-07T20:32:14.2354230Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2354371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2354475Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2354579Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2354967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2355070Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2355862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2356030Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2356416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2356653Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2357172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2357272Z     kernel = self.compile(
2025-05-07T20:32:14.2357670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2357852Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2357987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2358000Z 
2025-05-07T20:32:14.2358213Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb570640>
2025-05-07T20:32:14.2359094Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2359621Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb6be560>}
2025-05-07T20:32:14.2360388Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2360594Z context = <triton._C.libtriton.ir.context object at 0x7faaeb5cde30>
2025-05-07T20:32:14.2360606Z 
2025-05-07T20:32:14.2360778Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2361051Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2361173Z                            module_map=module_map)
2025-05-07T20:32:14.2361343Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2361450Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2361529Z E       ^
2025-05-07T20:32:14.2361896Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2361900Z 
2025-05-07T20:32:14.2362329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2362334Z 
2025-05-07T20:32:14.2362443Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2362794Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2362884Z     T=2048,
2025-05-07T20:32:14.2362963Z     D=5120,
2025-05-07T20:32:14.2363056Z     scale_ub=1200.0,
2025-05-07T20:32:14.2363148Z     contiguous=False,
2025-05-07T20:32:14.2363235Z     compiled=True,
2025-05-07T20:32:14.2363315Z )
2025-05-07T20:32:14.2363540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2363719Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.2363726Z 
2025-05-07T20:32:14.2363812Z     @given(
2025-05-07T20:32:14.2363935Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2364039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2364162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2364284Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2364411Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2364488Z     )
2025-05-07T20:32:14.2364746Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2364849Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2364931Z         self,
2025-05-07T20:32:14.2365010Z         T: int,
2025-05-07T20:32:14.2365093Z         D: int,
2025-05-07T20:32:14.2365195Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2365288Z         contiguous: bool,
2025-05-07T20:32:14.2365405Z         compiled: bool,
2025-05-07T20:32:14.2365485Z     ) -> None:
2025-05-07T20:32:14.2365634Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2365717Z     
2025-05-07T20:32:14.2371550Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2371647Z     
2025-05-07T20:32:14.2371751Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2371886Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2371989Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2372071Z         x0 = x[:, :D]
2025-05-07T20:32:14.2372162Z         x1 = x[:, D:]
2025-05-07T20:32:14.2372247Z     
2025-05-07T20:32:14.2372335Z         if contiguous:
2025-05-07T20:32:14.2372441Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2372615Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2372691Z     
2025-05-07T20:32:14.2372791Z         if scale_ub is not None:
2025-05-07T20:32:14.2372901Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2373041Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2373125Z             )
2025-05-07T20:32:14.2373207Z         else:
2025-05-07T20:32:14.2373305Z             scale_ub_tensor = None
2025-05-07T20:32:14.2373386Z     
2025-05-07T20:32:14.2373524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2373618Z             op = silu_mul_quant
2025-05-07T20:32:14.2373712Z             if compiled:
2025-05-07T20:32:14.2373819Z                 op = torch.compile(op)
2025-05-07T20:32:14.2373936Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2374014Z     
2025-05-07T20:32:14.2374111Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2374117Z 
2025-05-07T20:32:14.2374226Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2374363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2374468Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2374582Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2374968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2375069Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2375589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2375693Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2376071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2376390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2376743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2376853Z     kernel = self.compile(
2025-05-07T20:32:14.2377247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2377438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2377571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2377576Z 
2025-05-07T20:32:14.2377792Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb5f0910>
2025-05-07T20:32:14.2378695Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2379220Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb6bf370>}
2025-05-07T20:32:14.2380001Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2380200Z context = <triton._C.libtriton.ir.context object at 0x7faaeb288730>
2025-05-07T20:32:14.2380249Z 
2025-05-07T20:32:14.2380421Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2380703Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2380814Z                            module_map=module_map)
2025-05-07T20:32:14.2380988Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2381097Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2381177Z E       ^
2025-05-07T20:32:14.2381555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2381603Z 
2025-05-07T20:32:14.2382032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2382037Z 
2025-05-07T20:32:14.2382154Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2382386Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2382471Z     T=4096,
2025-05-07T20:32:14.2382560Z     D=5120,
2025-05-07T20:32:14.2382646Z     scale_ub=1200.0,
2025-05-07T20:32:14.2382734Z     contiguous=True,
2025-05-07T20:32:14.2382828Z     compiled=True,
2025-05-07T20:32:14.2382903Z )
2025-05-07T20:32:14.2383129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2383321Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.2383326Z 
2025-05-07T20:32:14.2383406Z     @given(
2025-05-07T20:32:14.2383538Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2383645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2383765Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2383895Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2384015Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2384092Z     )
2025-05-07T20:32:14.2384357Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2384455Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2384533Z         self,
2025-05-07T20:32:14.2384617Z         T: int,
2025-05-07T20:32:14.2384696Z         D: int,
2025-05-07T20:32:14.2384799Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2384900Z         contiguous: bool,
2025-05-07T20:32:14.2384987Z         compiled: bool,
2025-05-07T20:32:14.2385283Z     ) -> None:
2025-05-07T20:32:14.2385386Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2385462Z     
2025-05-07T20:32:14.2385643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2385723Z     
2025-05-07T20:32:14.2385817Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2385954Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2386045Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2386133Z         x0 = x[:, :D]
2025-05-07T20:32:14.2386225Z         x1 = x[:, D:]
2025-05-07T20:32:14.2386298Z     
2025-05-07T20:32:14.2386385Z         if contiguous:
2025-05-07T20:32:14.2386486Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2386577Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2386653Z     
2025-05-07T20:32:14.2386754Z         if scale_ub is not None:
2025-05-07T20:32:14.2386862Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2387015Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2387092Z             )
2025-05-07T20:32:14.2387170Z         else:
2025-05-07T20:32:14.2387274Z             scale_ub_tensor = None
2025-05-07T20:32:14.2387350Z     
2025-05-07T20:32:14.2387485Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2387586Z             op = silu_mul_quant
2025-05-07T20:32:14.2387677Z             if compiled:
2025-05-07T20:32:14.2387779Z                 op = torch.compile(op)
2025-05-07T20:32:14.2387897Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2388017Z     
2025-05-07T20:32:14.2388112Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2388123Z 
2025-05-07T20:32:14.2388223Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2388354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2388456Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2388572Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2388956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2389053Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2389566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2389708Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2390081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2390312Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2390664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2390767Z     kernel = self.compile(
2025-05-07T20:32:14.2391159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2391342Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2391474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2391481Z 
2025-05-07T20:32:14.2391695Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb2a71f0>
2025-05-07T20:32:14.2392550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2393074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb2a0310>}
2025-05-07T20:32:14.2393846Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2394122Z context = <triton._C.libtriton.ir.context object at 0x7faaeb233270>
2025-05-07T20:32:14.2394127Z 
2025-05-07T20:32:14.2394298Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2394580Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2394691Z                            module_map=module_map)
2025-05-07T20:32:14.2394861Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2394964Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2395043Z E       ^
2025-05-07T20:32:14.2395415Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2395419Z 
2025-05-07T20:32:14.2395845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2395850Z 
2025-05-07T20:32:14.2395957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2396200Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2396279Z     T=128,
2025-05-07T20:32:14.2396365Z     D=5120,
2025-05-07T20:32:14.2396450Z     scale_ub=1200.0,
2025-05-07T20:32:14.2396537Z     contiguous=False,
2025-05-07T20:32:14.2396628Z     compiled=True,
2025-05-07T20:32:14.2396703Z )
2025-05-07T20:32:14.2396930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2397115Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.2397163Z 
2025-05-07T20:32:14.2397244Z     @given(
2025-05-07T20:32:14.2397371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2397479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2397597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2397726Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2397849Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2397925Z     )
2025-05-07T20:32:14.2398186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2398325Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2398402Z         self,
2025-05-07T20:32:14.2398493Z         T: int,
2025-05-07T20:32:14.2398570Z         D: int,
2025-05-07T20:32:14.2398672Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2398770Z         contiguous: bool,
2025-05-07T20:32:14.2398858Z         compiled: bool,
2025-05-07T20:32:14.2398940Z     ) -> None:
2025-05-07T20:32:14.2399049Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2399123Z     
2025-05-07T20:32:14.2399305Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2399382Z     
2025-05-07T20:32:14.2399479Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2399614Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2399705Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2399790Z         x0 = x[:, :D]
2025-05-07T20:32:14.2399878Z         x1 = x[:, D:]
2025-05-07T20:32:14.2399953Z     
2025-05-07T20:32:14.2400039Z         if contiguous:
2025-05-07T20:32:14.2400144Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2400235Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2400309Z     
2025-05-07T20:32:14.2400410Z         if scale_ub is not None:
2025-05-07T20:32:14.2400520Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2400664Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2400744Z             )
2025-05-07T20:32:14.2400820Z         else:
2025-05-07T20:32:14.2400921Z             scale_ub_tensor = None
2025-05-07T20:32:14.2400996Z     
2025-05-07T20:32:14.2401128Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2401227Z             op = silu_mul_quant
2025-05-07T20:32:14.2401313Z             if compiled:
2025-05-07T20:32:14.2401416Z                 op = torch.compile(op)
2025-05-07T20:32:14.2401611Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2401687Z     
2025-05-07T20:32:14.2401780Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2401788Z 
2025-05-07T20:32:14.2401892Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2402023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2402133Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2402238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2402660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2402775Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2403282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2403382Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2403763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2403991Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2404351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2404446Z     kernel = self.compile(
2025-05-07T20:32:14.2404837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2405024Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2405196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2405201Z 
2025-05-07T20:32:14.2405418Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb522380>
2025-05-07T20:32:14.2406223Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2406744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb2a1090>}
2025-05-07T20:32:14.2407563Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2407764Z context = <triton._C.libtriton.ir.context object at 0x7faaeb1b1970>
2025-05-07T20:32:14.2407769Z 
2025-05-07T20:32:14.2407944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2408218Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2408328Z                            module_map=module_map)
2025-05-07T20:32:14.2408503Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2408606Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2408689Z E       ^
2025-05-07T20:32:14.2409054Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2409061Z 
2025-05-07T20:32:14.2409487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2409491Z 
2025-05-07T20:32:14.2409605Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2409838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2409916Z     T=16384,
2025-05-07T20:32:14.2410000Z     D=7168,
2025-05-07T20:32:14.2410086Z     scale_ub=1200.0,
2025-05-07T20:32:14.2410179Z     contiguous=True,
2025-05-07T20:32:14.2410264Z     compiled=True,
2025-05-07T20:32:14.2410339Z )
2025-05-07T20:32:14.2410573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2410833Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.2410838Z 
2025-05-07T20:32:14.2410916Z     @given(
2025-05-07T20:32:14.2411050Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2411151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2411272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2411399Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2411516Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2411600Z     )
2025-05-07T20:32:14.2411852Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2411948Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2412032Z         self,
2025-05-07T20:32:14.2412110Z         T: int,
2025-05-07T20:32:14.2412188Z         D: int,
2025-05-07T20:32:14.2412302Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2412395Z         contiguous: bool,
2025-05-07T20:32:14.2412492Z         compiled: bool,
2025-05-07T20:32:14.2412578Z     ) -> None:
2025-05-07T20:32:14.2412676Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2412751Z     
2025-05-07T20:32:14.2412936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2413011Z     
2025-05-07T20:32:14.2413110Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2413238Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2413329Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2413464Z         x0 = x[:, :D]
2025-05-07T20:32:14.2413546Z         x1 = x[:, D:]
2025-05-07T20:32:14.2413620Z     
2025-05-07T20:32:14.2413710Z         if contiguous:
2025-05-07T20:32:14.2413803Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2413895Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2413973Z     
2025-05-07T20:32:14.2414068Z         if scale_ub is not None:
2025-05-07T20:32:14.2414177Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2414328Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2414405Z             )
2025-05-07T20:32:14.2414490Z         else:
2025-05-07T20:32:14.2414657Z             scale_ub_tensor = None
2025-05-07T20:32:14.2414731Z     
2025-05-07T20:32:14.2414872Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2414967Z             op = silu_mul_quant
2025-05-07T20:32:14.2415056Z             if compiled:
2025-05-07T20:32:14.2415165Z                 op = torch.compile(op)
2025-05-07T20:32:14.2415275Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2415349Z     
2025-05-07T20:32:14.2415449Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2415454Z 
2025-05-07T20:32:14.2415555Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2415687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2415798Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2415904Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2416397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2416539Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2417095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2417203Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2417574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2417810Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2418263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2418360Z     kernel = self.compile(
2025-05-07T20:32:14.2419433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2419631Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2419798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2419809Z 
2025-05-07T20:32:14.2420088Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb23c8b0>
2025-05-07T20:32:14.2420894Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2421427Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb2a2290>}
2025-05-07T20:32:14.2422203Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2422410Z context = <triton._C.libtriton.ir.context object at 0x7faaeb1c83b0>
2025-05-07T20:32:14.2422416Z 
2025-05-07T20:32:14.2422587Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2422862Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2422979Z                            module_map=module_map)
2025-05-07T20:32:14.2423146Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2423305Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2423390Z E       ^
2025-05-07T20:32:14.2423758Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2423763Z 
2025-05-07T20:32:14.2424205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2424217Z 
2025-05-07T20:32:14.2424325Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2424555Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2424684Z     T=16384,
2025-05-07T20:32:14.2424761Z     D=5120,
2025-05-07T20:32:14.2424847Z     scale_ub=1200.0,
2025-05-07T20:32:14.2424940Z     contiguous=True,
2025-05-07T20:32:14.2425024Z     compiled=False,
2025-05-07T20:32:14.2425104Z )
2025-05-07T20:32:14.2425328Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2425515Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.2425520Z 
2025-05-07T20:32:14.2425605Z     @given(
2025-05-07T20:32:14.2425727Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2425830Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2425962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2426089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2426207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2426288Z     )
2025-05-07T20:32:14.2426540Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2426647Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2426724Z         self,
2025-05-07T20:32:14.2426802Z         T: int,
2025-05-07T20:32:14.2426884Z         D: int,
2025-05-07T20:32:14.2426986Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2427077Z         contiguous: bool,
2025-05-07T20:32:14.2427174Z         compiled: bool,
2025-05-07T20:32:14.2427254Z     ) -> None:
2025-05-07T20:32:14.2427354Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2427434Z     
2025-05-07T20:32:14.2427609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2427684Z     
2025-05-07T20:32:14.2427787Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2427916Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2428102Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2428187Z         x0 = x[:, :D]
2025-05-07T20:32:14.2428269Z         x1 = x[:, D:]
2025-05-07T20:32:14.2428355Z     
2025-05-07T20:32:14.2428441Z         if contiguous:
2025-05-07T20:32:14.2428534Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2428632Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2428709Z     
2025-05-07T20:32:14.2428803Z         if scale_ub is not None:
2025-05-07T20:32:14.2428919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2429062Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2429139Z             )
2025-05-07T20:32:14.2429224Z         else:
2025-05-07T20:32:14.2429328Z             scale_ub_tensor = None
2025-05-07T20:32:14.2429402Z     
2025-05-07T20:32:14.2429550Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2429643Z             op = silu_mul_quant
2025-05-07T20:32:14.2429738Z             if compiled:
2025-05-07T20:32:14.2429845Z                 op = torch.compile(op)
2025-05-07T20:32:14.2429956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2430035Z     
2025-05-07T20:32:14.2430130Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2430135Z 
2025-05-07T20:32:14.2430235Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2430370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2430512Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2430659Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2431309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2431412Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2431790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2432029Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2432384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2432534Z     kernel = self.compile(
2025-05-07T20:32:14.2432927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2433114Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2433241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2433249Z 
2025-05-07T20:32:14.2433465Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb10eaa0>
2025-05-07T20:32:14.2434268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2434807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb2a11b0>}
2025-05-07T20:32:14.2435582Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2435782Z context = <triton._C.libtriton.ir.context object at 0x7fac9c0f88b0>
2025-05-07T20:32:14.2435786Z 
2025-05-07T20:32:14.2435967Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2436239Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2436351Z                            module_map=module_map)
2025-05-07T20:32:14.2436525Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2436630Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2436717Z E       ^
2025-05-07T20:32:14.2437162Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2437169Z 
2025-05-07T20:32:14.2437598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2437603Z 
2025-05-07T20:32:14.2437718Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2437948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2438040Z     T=1,
2025-05-07T20:32:14.2438124Z     D=7168,
2025-05-07T20:32:14.2438216Z     scale_ub=1200.0,
2025-05-07T20:32:14.2438314Z     contiguous=False,
2025-05-07T20:32:14.2438402Z     compiled=False,
2025-05-07T20:32:14.2438477Z )
2025-05-07T20:32:14.2438707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2438884Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.2438889Z 
2025-05-07T20:32:14.2438975Z     @given(
2025-05-07T20:32:14.2439105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2439210Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2439335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2439465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2439584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2439666Z     )
2025-05-07T20:32:14.2439917Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2440060Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2440145Z         self,
2025-05-07T20:32:14.2440224Z         T: int,
2025-05-07T20:32:14.2440302Z         D: int,
2025-05-07T20:32:14.2440409Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2440503Z         contiguous: bool,
2025-05-07T20:32:14.2440592Z         compiled: bool,
2025-05-07T20:32:14.2440678Z     ) -> None:
2025-05-07T20:32:14.2440782Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2440856Z     
2025-05-07T20:32:14.2441035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2441157Z     
2025-05-07T20:32:14.2441258Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2441386Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2441476Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2441563Z         x0 = x[:, :D]
2025-05-07T20:32:14.2441644Z         x1 = x[:, D:]
2025-05-07T20:32:14.2441719Z     
2025-05-07T20:32:14.2441809Z         if contiguous:
2025-05-07T20:32:14.2441901Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2441992Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2442071Z     
2025-05-07T20:32:14.2442164Z         if scale_ub is not None:
2025-05-07T20:32:14.2442272Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2442420Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2442495Z             )
2025-05-07T20:32:14.2442583Z         else:
2025-05-07T20:32:14.2442680Z             scale_ub_tensor = None
2025-05-07T20:32:14.2442754Z     
2025-05-07T20:32:14.2442894Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2442989Z             op = silu_mul_quant
2025-05-07T20:32:14.2443075Z             if compiled:
2025-05-07T20:32:14.2443185Z                 op = torch.compile(op)
2025-05-07T20:32:14.2443294Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2443372Z     
2025-05-07T20:32:14.2443479Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2443483Z 
2025-05-07T20:32:14.2443583Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2443723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2443826Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2443929Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2444534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2444637Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2445005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2445241Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2445593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2445696Z     kernel = self.compile(
2025-05-07T20:32:14.2446090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2446270Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2446404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2446408Z 
2025-05-07T20:32:14.2446628Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb520bb0>
2025-05-07T20:32:14.2447433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2447954Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb2a2680>}
2025-05-07T20:32:14.2448721Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2449006Z context = <triton._C.libtriton.ir.context object at 0x7faaeb4850b0>
2025-05-07T20:32:14.2449011Z 
2025-05-07T20:32:14.2449182Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2449468Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2449581Z                            module_map=module_map)
2025-05-07T20:32:14.2449790Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2449900Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2449980Z E       ^
2025-05-07T20:32:14.2450345Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2450357Z 
2025-05-07T20:32:14.2450784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2450789Z 
2025-05-07T20:32:14.2450896Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2451134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2451215Z     T=4096,
2025-05-07T20:32:14.2451294Z     D=7168,
2025-05-07T20:32:14.2451387Z     scale_ub=1200.0,
2025-05-07T20:32:14.2451482Z     contiguous=False,
2025-05-07T20:32:14.2451569Z     compiled=True,
2025-05-07T20:32:14.2451651Z )
2025-05-07T20:32:14.2451875Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2452066Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.2452071Z 
2025-05-07T20:32:14.2452152Z     @given(
2025-05-07T20:32:14.2452275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2452388Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2452511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2452633Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2452758Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2452834Z     )
2025-05-07T20:32:14.2453086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2453189Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2453349Z         self,
2025-05-07T20:32:14.2453435Z         T: int,
2025-05-07T20:32:14.2453512Z         D: int,
2025-05-07T20:32:14.2453614Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2453715Z         contiguous: bool,
2025-05-07T20:32:14.2453804Z         compiled: bool,
2025-05-07T20:32:14.2453883Z     ) -> None:
2025-05-07T20:32:14.2453988Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2454061Z     
2025-05-07T20:32:14.2454234Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2454318Z     
2025-05-07T20:32:14.2454414Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2454544Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2454643Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2454723Z         x0 = x[:, :D]
2025-05-07T20:32:14.2454811Z         x1 = x[:, D:]
2025-05-07T20:32:14.2454885Z     
2025-05-07T20:32:14.2454969Z         if contiguous:
2025-05-07T20:32:14.2455068Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2455162Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2455237Z     
2025-05-07T20:32:14.2455337Z         if scale_ub is not None:
2025-05-07T20:32:14.2455445Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2455911Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2456019Z             )
2025-05-07T20:32:14.2456097Z         else:
2025-05-07T20:32:14.2456193Z             scale_ub_tensor = None
2025-05-07T20:32:14.2456272Z     
2025-05-07T20:32:14.2456407Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2456671Z             op = silu_mul_quant
2025-05-07T20:32:14.2456767Z             if compiled:
2025-05-07T20:32:14.2456868Z                 op = torch.compile(op)
2025-05-07T20:32:14.2456984Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2457056Z     
2025-05-07T20:32:14.2457148Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2457153Z 
2025-05-07T20:32:14.2457257Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2457394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2457500Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2457697Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2458174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2458277Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2458784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2458890Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2459263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2459492Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2459847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2459950Z     kernel = self.compile(
2025-05-07T20:32:14.2460341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2460529Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2460656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2460660Z 
2025-05-07T20:32:14.2460875Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb4bdea0>
2025-05-07T20:32:14.2461680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2462390Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb2a3b50>}
2025-05-07T20:32:14.2463172Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2463373Z context = <triton._C.libtriton.ir.context object at 0x7faaeb4a96f0>
2025-05-07T20:32:14.2463377Z 
2025-05-07T20:32:14.2463552Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2463823Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2463937Z                            module_map=module_map)
2025-05-07T20:32:14.2464107Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2464208Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2464287Z E       ^
2025-05-07T20:32:14.2464662Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2464667Z 
2025-05-07T20:32:14.2465092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2465099Z 
2025-05-07T20:32:14.2465212Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2465441Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2465519Z     T=128,
2025-05-07T20:32:14.2465604Z     D=7168,
2025-05-07T20:32:14.2465689Z     scale_ub=1200.0,
2025-05-07T20:32:14.2465825Z     contiguous=False,
2025-05-07T20:32:14.2465917Z     compiled=True,
2025-05-07T20:32:14.2465990Z )
2025-05-07T20:32:14.2466214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2466397Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.2466402Z 
2025-05-07T20:32:14.2466482Z     @given(
2025-05-07T20:32:14.2466609Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2466717Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2466835Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2467005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2467124Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2467198Z     )
2025-05-07T20:32:14.2467456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2467556Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2467636Z         self,
2025-05-07T20:32:14.2467720Z         T: int,
2025-05-07T20:32:14.2467798Z         D: int,
2025-05-07T20:32:14.2467907Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2467999Z         contiguous: bool,
2025-05-07T20:32:14.2468087Z         compiled: bool,
2025-05-07T20:32:14.2468173Z     ) -> None:
2025-05-07T20:32:14.2468271Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2468345Z     
2025-05-07T20:32:14.2468531Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2468606Z     
2025-05-07T20:32:14.2468702Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2468836Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2468929Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2469011Z         x0 = x[:, :D]
2025-05-07T20:32:14.2469101Z         x1 = x[:, D:]
2025-05-07T20:32:14.2469175Z     
2025-05-07T20:32:14.2469267Z         if contiguous:
2025-05-07T20:32:14.2469360Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2469452Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2469532Z     
2025-05-07T20:32:14.2469624Z         if scale_ub is not None:
2025-05-07T20:32:14.2469733Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2469880Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2469956Z             )
2025-05-07T20:32:14.2470032Z         else:
2025-05-07T20:32:14.2470134Z             scale_ub_tensor = None
2025-05-07T20:32:14.2470289Z     
2025-05-07T20:32:14.2470423Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2470523Z             op = silu_mul_quant
2025-05-07T20:32:14.2470611Z             if compiled:
2025-05-07T20:32:14.2470713Z                 op = torch.compile(op)
2025-05-07T20:32:14.2470828Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2470900Z     
2025-05-07T20:32:14.2470997Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2471002Z 
2025-05-07T20:32:14.2471103Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2471238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2471347Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2471452Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2471831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2471931Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2472445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2472551Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2472971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2473198Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2473555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2473696Z     kernel = self.compile(
2025-05-07T20:32:14.2474088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2474275Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2474402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2474406Z 
2025-05-07T20:32:14.2474628Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb1281c0>
2025-05-07T20:32:14.2475425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2476004Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb318670>}
2025-05-07T20:32:14.2476774Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2476974Z context = <triton._C.libtriton.ir.context object at 0x7faaeb3aff30>
2025-05-07T20:32:14.2476979Z 
2025-05-07T20:32:14.2477159Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2477431Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2477551Z                            module_map=module_map)
2025-05-07T20:32:14.2477717Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2477820Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2477910Z E       ^
2025-05-07T20:32:14.2478273Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2478280Z 
2025-05-07T20:32:14.2478705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2478716Z 
2025-05-07T20:32:14.2478825Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2479053Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2479138Z     T=2048,
2025-05-07T20:32:14.2479327Z     D=7168,
2025-05-07T20:32:14.2479414Z     scale_ub=None,
2025-05-07T20:32:14.2479508Z     contiguous=True,
2025-05-07T20:32:14.2479595Z     compiled=True,
2025-05-07T20:32:14.2479672Z )
2025-05-07T20:32:14.2479901Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2480078Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.2480083Z 
2025-05-07T20:32:14.2480168Z     @given(
2025-05-07T20:32:14.2480292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2480400Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2480526Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2480649Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2480767Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2480852Z     )
2025-05-07T20:32:14.2481105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2481208Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2481295Z         self,
2025-05-07T20:32:14.2481374Z         T: int,
2025-05-07T20:32:14.2481455Z         D: int,
2025-05-07T20:32:14.2481566Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2481662Z         contiguous: bool,
2025-05-07T20:32:14.2481757Z         compiled: bool,
2025-05-07T20:32:14.2481838Z     ) -> None:
2025-05-07T20:32:14.2481938Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2482022Z     
2025-05-07T20:32:14.2482198Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2482319Z     
2025-05-07T20:32:14.2482423Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2482552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2482643Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2482733Z         x0 = x[:, :D]
2025-05-07T20:32:14.2482815Z         x1 = x[:, D:]
2025-05-07T20:32:14.2482890Z     
2025-05-07T20:32:14.2482985Z         if contiguous:
2025-05-07T20:32:14.2483088Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2483180Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2483261Z     
2025-05-07T20:32:14.2483399Z         if scale_ub is not None:
2025-05-07T20:32:14.2483516Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2483657Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2483735Z             )
2025-05-07T20:32:14.2483818Z         else:
2025-05-07T20:32:14.2483916Z             scale_ub_tensor = None
2025-05-07T20:32:14.2483992Z     
2025-05-07T20:32:14.2484135Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2484227Z             op = silu_mul_quant
2025-05-07T20:32:14.2484312Z             if compiled:
2025-05-07T20:32:14.2484421Z                 op = torch.compile(op)
2025-05-07T20:32:14.2484530Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2484605Z     
2025-05-07T20:32:14.2484704Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2484709Z 
2025-05-07T20:32:14.2484813Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2484950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2485057Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2485160Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2485548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2485647Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2486153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2486264Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2486631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2486863Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2487294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2487391Z     kernel = self.compile(
2025-05-07T20:32:14.2487789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2487969Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2488104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2488111Z 
2025-05-07T20:32:14.2488324Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb4b4220>
2025-05-07T20:32:14.2489120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2489650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb3191b0>}
2025-05-07T20:32:14.2490413Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2490622Z context = <triton._C.libtriton.ir.context object at 0x7faaeb38eb30>
2025-05-07T20:32:14.2490627Z 
2025-05-07T20:32:14.2490797Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2491108Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2491224Z                            module_map=module_map)
2025-05-07T20:32:14.2491390Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2491499Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2491579Z E       ^
2025-05-07T20:32:14.2491948Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2491952Z 
2025-05-07T20:32:14.2492388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2492433Z 
2025-05-07T20:32:14.2492560Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2492829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2492910Z     T=16384,
2025-05-07T20:32:14.2492991Z     D=5120,
2025-05-07T20:32:14.2493083Z     scale_ub=None,
2025-05-07T20:32:14.2493173Z     contiguous=False,
2025-05-07T20:32:14.2493262Z     compiled=False,
2025-05-07T20:32:14.2493342Z )
2025-05-07T20:32:14.2493567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2493750Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.2493755Z 
2025-05-07T20:32:14.2493840Z     @given(
2025-05-07T20:32:14.2493968Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2494083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2494206Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2494328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2494450Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2494531Z     )
2025-05-07T20:32:14.2494783Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2494889Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2494967Z         self,
2025-05-07T20:32:14.2495046Z         T: int,
2025-05-07T20:32:14.2495131Z         D: int,
2025-05-07T20:32:14.2495234Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2495326Z         contiguous: bool,
2025-05-07T20:32:14.2495421Z         compiled: bool,
2025-05-07T20:32:14.2495502Z     ) -> None:
2025-05-07T20:32:14.2495609Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2495684Z     
2025-05-07T20:32:14.2495941Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2496027Z     
2025-05-07T20:32:14.2496126Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2496256Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2498275Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2498285Z 
2025-05-07T20:32:14.2498409Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:14.2498414Z 
2025-05-07T20:32:14.2498533Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2498781Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2498863Z     T=4096,
2025-05-07T20:32:14.2498950Z     D=7168,
2025-05-07T20:32:14.2499036Z     scale_ub=1200.0,
2025-05-07T20:32:14.2504864Z     contiguous=True,
2025-05-07T20:32:14.2504979Z     compiled=True,
2025-05-07T20:32:14.2505059Z )
2025-05-07T20:32:14.2505299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2505566Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.2505571Z 
2025-05-07T20:32:14.2505651Z     @given(
2025-05-07T20:32:14.2505787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2505890Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2506016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2506144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2506260Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2506342Z     )
2025-05-07T20:32:14.2506603Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2506750Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2506835Z         self,
2025-05-07T20:32:14.2506913Z         T: int,
2025-05-07T20:32:14.2506992Z         D: int,
2025-05-07T20:32:14.2507099Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2507191Z         contiguous: bool,
2025-05-07T20:32:14.2507287Z         compiled: bool,
2025-05-07T20:32:14.2507375Z     ) -> None:
2025-05-07T20:32:14.2507474Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2507557Z     
2025-05-07T20:32:14.2507731Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2507808Z     
2025-05-07T20:32:14.2507912Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2508042Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2509919Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2509939Z 
2025-05-07T20:32:14.2510066Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:14.2510071Z 
2025-05-07T20:32:14.2510179Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2510418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2510499Z     T=16384,
2025-05-07T20:32:14.2510585Z     D=7168,
2025-05-07T20:32:14.2510757Z     scale_ub=None,
2025-05-07T20:32:14.2510851Z     contiguous=False,
2025-05-07T20:32:14.2510945Z     compiled=False,
2025-05-07T20:32:14.2511020Z )
2025-05-07T20:32:14.2511251Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2511472Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.2511479Z 
2025-05-07T20:32:14.2511586Z     @given(
2025-05-07T20:32:14.2511751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2511907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2512059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2512190Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2512310Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2512387Z     )
2025-05-07T20:32:14.2512648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2512745Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2512830Z         self,
2025-05-07T20:32:14.2512916Z         T: int,
2025-05-07T20:32:14.2512994Z         D: int,
2025-05-07T20:32:14.2513099Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2513202Z         contiguous: bool,
2025-05-07T20:32:14.2516295Z         compiled: bool,
2025-05-07T20:32:14.2516403Z     ) -> None:
2025-05-07T20:32:14.2516512Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2516589Z     
2025-05-07T20:32:14.2516770Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2518715Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2518722Z 
2025-05-07T20:32:14.2518925Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2518930Z 
2025-05-07T20:32:14.2519069Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2519373Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2519493Z     T=2048,
2025-05-07T20:32:14.2519613Z     D=7168,
2025-05-07T20:32:14.2519714Z     scale_ub=1200.0,
2025-05-07T20:32:14.2519805Z     contiguous=True,
2025-05-07T20:32:14.2519901Z     compiled=True,
2025-05-07T20:32:14.2519978Z )
2025-05-07T20:32:14.2520217Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2520398Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.2520403Z 
2025-05-07T20:32:14.2520485Z     @given(
2025-05-07T20:32:14.2520620Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2520724Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2520843Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2520977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2521146Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2521258Z     )
2025-05-07T20:32:14.2521525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2521626Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2521718Z         self,
2025-05-07T20:32:14.2521798Z         T: int,
2025-05-07T20:32:14.2521878Z         D: int,
2025-05-07T20:32:14.2521990Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2522084Z         contiguous: bool,
2025-05-07T20:32:14.2522174Z         compiled: bool,
2025-05-07T20:32:14.2522267Z     ) -> None:
2025-05-07T20:32:14.2522369Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2522472Z     
2025-05-07T20:32:14.2522803Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2522916Z     
2025-05-07T20:32:14.2523046Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2523240Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2525243Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2525255Z 
2025-05-07T20:32:14.2525393Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:14.2525397Z 
2025-05-07T20:32:14.2525513Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2525754Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2525840Z     T=2048,
2025-05-07T20:32:14.2525922Z     D=7168,
2025-05-07T20:32:14.2526016Z     scale_ub=None,
2025-05-07T20:32:14.2526219Z     contiguous=True,
2025-05-07T20:32:14.2526309Z     compiled=False,
2025-05-07T20:32:14.2526396Z )
2025-05-07T20:32:14.2526623Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2526848Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.2526859Z 
2025-05-07T20:32:14.2526940Z     @given(
2025-05-07T20:32:14.2527063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2527173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2527290Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2527417Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2527537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2527620Z     )
2025-05-07T20:32:14.2527875Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2528089Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2528176Z         self,
2025-05-07T20:32:14.2528266Z         T: int,
2025-05-07T20:32:14.2528347Z         D: int,
2025-05-07T20:32:14.2528455Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2528548Z         contiguous: bool,
2025-05-07T20:32:14.2528645Z         compiled: bool,
2025-05-07T20:32:14.2528727Z     ) -> None:
2025-05-07T20:32:14.2528826Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2528910Z     
2025-05-07T20:32:14.2529086Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2529162Z     
2025-05-07T20:32:14.2529264Z >       x_sign = torch.sign(x)
2025-05-07T20:32:14.2531124Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2531133Z 
2025-05-07T20:32:14.2531271Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:14.2531276Z 
2025-05-07T20:32:14.2531384Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2531616Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2531702Z     T=1,
2025-05-07T20:32:14.2531783Z     D=7168,
2025-05-07T20:32:14.2531874Z     scale_ub=1200.0,
2025-05-07T20:32:14.2531964Z     contiguous=True,
2025-05-07T20:32:14.2532055Z     compiled=False,
2025-05-07T20:32:14.2532181Z )
2025-05-07T20:32:14.2532408Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2532587Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.2532594Z 
2025-05-07T20:32:14.2532686Z     @given(
2025-05-07T20:32:14.2532812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2532916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2533041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2533165Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2533290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2533368Z     )
2025-05-07T20:32:14.2533621Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2533726Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2533806Z         self,
2025-05-07T20:32:14.2533887Z         T: int,
2025-05-07T20:32:14.2533975Z         D: int,
2025-05-07T20:32:14.2534078Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2534172Z         contiguous: bool,
2025-05-07T20:32:14.2534274Z         compiled: bool,
2025-05-07T20:32:14.2534356Z     ) -> None:
2025-05-07T20:32:14.2534456Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2534609Z     
2025-05-07T20:32:14.2534789Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2534872Z     
2025-05-07T20:32:14.2534969Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2535142Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2535242Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2535326Z         x0 = x[:, :D]
2025-05-07T20:32:14.2535410Z         x1 = x[:, D:]
2025-05-07T20:32:14.2535493Z     
2025-05-07T20:32:14.2535583Z         if contiguous:
2025-05-07T20:32:14.2535682Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2535784Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2535858Z     
2025-05-07T20:32:14.2535958Z         if scale_ub is not None:
2025-05-07T20:32:14.2536076Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2536226Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2536347Z             )
2025-05-07T20:32:14.2536435Z         else:
2025-05-07T20:32:14.2536538Z             scale_ub_tensor = None
2025-05-07T20:32:14.2536620Z     
2025-05-07T20:32:14.2536757Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2536853Z             op = silu_mul_quant
2025-05-07T20:32:14.2536949Z             if compiled:
2025-05-07T20:32:14.2537054Z                 op = torch.compile(op)
2025-05-07T20:32:14.2537172Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2537253Z     
2025-05-07T20:32:14.2537349Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2537354Z 
2025-05-07T20:32:14.2537456Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2537594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2537703Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2537814Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2538424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2538532Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2538911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2539144Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2539498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2539601Z     kernel = self.compile(
2025-05-07T20:32:14.2539998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2540240Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2540377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2540385Z 
2025-05-07T20:32:14.2540603Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb306fb0>
2025-05-07T20:32:14.2541417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2541948Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb31ae60>}
2025-05-07T20:32:14.2542726Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2542930Z context = <triton._C.libtriton.ir.context object at 0x7faaeb0316b0>
2025-05-07T20:32:14.2542935Z 
2025-05-07T20:32:14.2543114Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2543441Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2543555Z                            module_map=module_map)
2025-05-07T20:32:14.2543737Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2543842Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2543962Z E       ^
2025-05-07T20:32:14.2544340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2544347Z 
2025-05-07T20:32:14.2544778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2544783Z 
2025-05-07T20:32:14.2544899Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2545134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2545216Z     T=128,
2025-05-07T20:32:14.2545342Z     D=5120,
2025-05-07T20:32:14.2545428Z     scale_ub=None,
2025-05-07T20:32:14.2545519Z     contiguous=True,
2025-05-07T20:32:14.2545616Z     compiled=False,
2025-05-07T20:32:14.2545692Z )
2025-05-07T20:32:14.2545921Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2546104Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.2546112Z 
2025-05-07T20:32:14.2546193Z     @given(
2025-05-07T20:32:14.2546322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2546426Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2546546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2546678Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2546799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2546876Z     )
2025-05-07T20:32:14.2547141Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2547245Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2547324Z         self,
2025-05-07T20:32:14.2547411Z         T: int,
2025-05-07T20:32:14.2547492Z         D: int,
2025-05-07T20:32:14.2547606Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2547700Z         contiguous: bool,
2025-05-07T20:32:14.2547790Z         compiled: bool,
2025-05-07T20:32:14.2547880Z     ) -> None:
2025-05-07T20:32:14.2547987Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2548065Z     
2025-05-07T20:32:14.2548247Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2548326Z     
2025-05-07T20:32:14.2548423Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2548561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2548655Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2548793Z         x0 = x[:, :D]
2025-05-07T20:32:14.2548886Z         x1 = x[:, D:]
2025-05-07T20:32:14.2548963Z     
2025-05-07T20:32:14.2549060Z         if contiguous:
2025-05-07T20:32:14.2549160Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2549253Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2549337Z     
2025-05-07T20:32:14.2549434Z         if scale_ub is not None:
2025-05-07T20:32:14.2549545Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2549693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2549777Z             )
2025-05-07T20:32:14.2549857Z         else:
2025-05-07T20:32:14.2549963Z             scale_ub_tensor = None
2025-05-07T20:32:14.2550039Z     
2025-05-07T20:32:14.2550176Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2550277Z             op = silu_mul_quant
2025-05-07T20:32:14.2550367Z             if compiled:
2025-05-07T20:32:14.2550478Z                 op = torch.compile(op)
2025-05-07T20:32:14.2550592Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2550673Z     
2025-05-07T20:32:14.2550776Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2550785Z 
2025-05-07T20:32:14.2550894Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2551079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2551192Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2551297Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2551820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2551997Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2552372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2552634Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2553017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2553117Z     kernel = self.compile(
2025-05-07T20:32:14.2553518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2553744Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2553879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2553884Z 
2025-05-07T20:32:14.2554104Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeb015c30>
2025-05-07T20:32:14.2554914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2555448Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeb31b7f0>}
2025-05-07T20:32:14.2556468Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2556684Z context = <triton._C.libtriton.ir.context object at 0x7faaeaf1cb70>
2025-05-07T20:32:14.2556689Z 
2025-05-07T20:32:14.2556863Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2557143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2557259Z                            module_map=module_map)
2025-05-07T20:32:14.2557429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2557537Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2557620Z E       ^
2025-05-07T20:32:14.2558108Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2558114Z 
2025-05-07T20:32:14.2558551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2558558Z 
2025-05-07T20:32:14.2558674Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2558916Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2558998Z     T=128,
2025-05-07T20:32:14.2559078Z     D=7168,
2025-05-07T20:32:14.2559173Z     scale_ub=None,
2025-05-07T20:32:14.2559263Z     contiguous=True,
2025-05-07T20:32:14.2559353Z     compiled=False,
2025-05-07T20:32:14.2559436Z )
2025-05-07T20:32:14.2559663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2559842Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.2559847Z 
2025-05-07T20:32:14.2559936Z     @given(
2025-05-07T20:32:14.2560063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2560168Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2560296Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2560421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2560615Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2560702Z     )
2025-05-07T20:32:14.2560957Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2561123Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2561202Z         self,
2025-05-07T20:32:14.2561282Z         T: int,
2025-05-07T20:32:14.2561367Z         D: int,
2025-05-07T20:32:14.2561470Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2561564Z         contiguous: bool,
2025-05-07T20:32:14.2561663Z         compiled: bool,
2025-05-07T20:32:14.2561747Z     ) -> None:
2025-05-07T20:32:14.2561847Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2561927Z     
2025-05-07T20:32:14.2562106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2562190Z     
2025-05-07T20:32:14.2562287Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2562488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2562593Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2562678Z         x0 = x[:, :D]
2025-05-07T20:32:14.2562776Z         x1 = x[:, D:]
2025-05-07T20:32:14.2562869Z     
2025-05-07T20:32:14.2562971Z         if contiguous:
2025-05-07T20:32:14.2563079Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2563178Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2563253Z     
2025-05-07T20:32:14.2563349Z         if scale_ub is not None:
2025-05-07T20:32:14.2563469Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2563610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2563694Z             )
2025-05-07T20:32:14.2563776Z         else:
2025-05-07T20:32:14.2563881Z             scale_ub_tensor = None
2025-05-07T20:32:14.2563964Z     
2025-05-07T20:32:14.2564103Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2564196Z             op = silu_mul_quant
2025-05-07T20:32:14.2564297Z             if compiled:
2025-05-07T20:32:14.2564405Z                 op = torch.compile(op)
2025-05-07T20:32:14.2564517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2564599Z     
2025-05-07T20:32:14.2564694Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2564698Z 
2025-05-07T20:32:14.2564805Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2564945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2565052Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2565169Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2565690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2565839Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2566218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2566451Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2566814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2566912Z     kernel = self.compile(
2025-05-07T20:32:14.2567308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2567500Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2567629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2567634Z 
2025-05-07T20:32:14.2567850Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeafb7f10>
2025-05-07T20:32:14.2568661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2569244Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeaf78160>}
2025-05-07T20:32:14.2570025Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2570278Z context = <triton._C.libtriton.ir.context object at 0x7faaeaf081b0>
2025-05-07T20:32:14.2570283Z 
2025-05-07T20:32:14.2570462Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2570738Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2570853Z                            module_map=module_map)
2025-05-07T20:32:14.2571026Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2571172Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2571253Z E       ^
2025-05-07T20:32:14.2571630Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2571635Z 
2025-05-07T20:32:14.2572063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2572089Z 
2025-05-07T20:32:14.2572199Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2572432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2572520Z     T=2048,
2025-05-07T20:32:14.2572599Z     D=7168,
2025-05-07T20:32:14.2572688Z     scale_ub=1200.0,
2025-05-07T20:32:14.2572785Z     contiguous=True,
2025-05-07T20:32:14.2572874Z     compiled=False,
2025-05-07T20:32:14.2572960Z )
2025-05-07T20:32:14.2573199Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2573380Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.2573388Z 
2025-05-07T20:32:14.2573478Z     @given(
2025-05-07T20:32:14.2573603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2573708Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2573836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2573961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2574085Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2574171Z     )
2025-05-07T20:32:14.2574426Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2574526Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2574612Z         self,
2025-05-07T20:32:14.2574694Z         T: int,
2025-05-07T20:32:14.2574819Z         D: int,
2025-05-07T20:32:14.2574934Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2575029Z         contiguous: bool,
2025-05-07T20:32:14.2575127Z         compiled: bool,
2025-05-07T20:32:14.2575212Z     ) -> None:
2025-05-07T20:32:14.2575313Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2575403Z     
2025-05-07T20:32:14.2575580Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2577447Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2577462Z 
2025-05-07T20:32:14.2577587Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2577592Z 
2025-05-07T20:32:14.2577703Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2578144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2578238Z     T=1,
2025-05-07T20:32:14.2578319Z     D=5120,
2025-05-07T20:32:14.2578415Z     scale_ub=1200.0,
2025-05-07T20:32:14.2578504Z     contiguous=True,
2025-05-07T20:32:14.2578596Z     compiled=False,
2025-05-07T20:32:14.2578715Z )
2025-05-07T20:32:14.2578941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2579122Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.2579127Z 
2025-05-07T20:32:14.2579207Z     @given(
2025-05-07T20:32:14.2579332Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2579443Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2579565Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2579686Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2579851Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2579927Z     )
2025-05-07T20:32:14.2580190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2580290Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2580368Z         self,
2025-05-07T20:32:14.2580453Z         T: int,
2025-05-07T20:32:14.2580535Z         D: int,
2025-05-07T20:32:14.2580637Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2580737Z         contiguous: bool,
2025-05-07T20:32:14.2580826Z         compiled: bool,
2025-05-07T20:32:14.2580907Z     ) -> None:
2025-05-07T20:32:14.2581013Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2581088Z     
2025-05-07T20:32:14.2581262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2581346Z     
2025-05-07T20:32:14.2581444Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2581581Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2581675Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2581762Z         x0 = x[:, :D]
2025-05-07T20:32:14.2581850Z         x1 = x[:, D:]
2025-05-07T20:32:14.2581929Z     
2025-05-07T20:32:14.2582016Z         if contiguous:
2025-05-07T20:32:14.2582117Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2582215Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2582292Z     
2025-05-07T20:32:14.2582400Z         if scale_ub is not None:
2025-05-07T20:32:14.2582529Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2582694Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2582777Z             )
2025-05-07T20:32:14.2582857Z         else:
2025-05-07T20:32:14.2582960Z             scale_ub_tensor = None
2025-05-07T20:32:14.2583036Z     
2025-05-07T20:32:14.2583172Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2583316Z             op = silu_mul_quant
2025-05-07T20:32:14.2583406Z             if compiled:
2025-05-07T20:32:14.2583514Z                 op = torch.compile(op)
2025-05-07T20:32:14.2583632Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2583708Z     
2025-05-07T20:32:14.2583806Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2583810Z 
2025-05-07T20:32:14.2583919Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2584052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2584166Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2584277Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2584796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2584904Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2585280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2585511Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2585875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2586048Z     kernel = self.compile(
2025-05-07T20:32:14.2586457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2586642Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2586815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2586819Z 
2025-05-07T20:32:14.2587042Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeaf13310>
2025-05-07T20:32:14.2587846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2588377Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeaf78940>}
2025-05-07T20:32:14.2589191Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2589399Z context = <triton._C.libtriton.ir.context object at 0x7faaead5e4b0>
2025-05-07T20:32:14.2589404Z 
2025-05-07T20:32:14.2589583Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2589859Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2589980Z                            module_map=module_map)
2025-05-07T20:32:14.2590153Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2590257Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2590343Z E       ^
2025-05-07T20:32:14.2590711Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2590718Z 
2025-05-07T20:32:14.2591155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2591160Z 
2025-05-07T20:32:14.2591269Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2591504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2591591Z     T=2048,
2025-05-07T20:32:14.2591671Z     D=5120,
2025-05-07T20:32:14.2591758Z     scale_ub=None,
2025-05-07T20:32:14.2591855Z     contiguous=True,
2025-05-07T20:32:14.2591945Z     compiled=False,
2025-05-07T20:32:14.2592021Z )
2025-05-07T20:32:14.2592254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2592484Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.2592489Z 
2025-05-07T20:32:14.2592581Z     @given(
2025-05-07T20:32:14.2592729Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2592856Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2592987Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2593109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2593228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2593314Z     )
2025-05-07T20:32:14.2593570Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2593667Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2593753Z         self,
2025-05-07T20:32:14.2593833Z         T: int,
2025-05-07T20:32:14.2593918Z         D: int,
2025-05-07T20:32:14.2594022Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2594114Z         contiguous: bool,
2025-05-07T20:32:14.2594212Z         compiled: bool,
2025-05-07T20:32:14.2594294Z     ) -> None:
2025-05-07T20:32:14.2594394Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2594478Z     
2025-05-07T20:32:14.2594652Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2594776Z     
2025-05-07T20:32:14.2594879Z >       x_sign = torch.sign(x)
2025-05-07T20:32:14.2596734Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2596781Z 
2025-05-07T20:32:14.2596914Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:14.2596919Z 
2025-05-07T20:32:14.2597026Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2597306Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2597390Z     T=16384,
2025-05-07T20:32:14.2597471Z     D=5120,
2025-05-07T20:32:14.2597563Z     scale_ub=None,
2025-05-07T20:32:14.2597651Z     contiguous=True,
2025-05-07T20:32:14.2597739Z     compiled=False,
2025-05-07T20:32:14.2597822Z )
2025-05-07T20:32:14.2598047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2598230Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.2598235Z 
2025-05-07T20:32:14.2598322Z     @given(
2025-05-07T20:32:14.2598444Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2598549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2598676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2598798Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2598922Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2599003Z     )
2025-05-07T20:32:14.2599259Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2599363Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2599443Z         self,
2025-05-07T20:32:14.2599523Z         T: int,
2025-05-07T20:32:14.2599608Z         D: int,
2025-05-07T20:32:14.2599713Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2599807Z         contiguous: bool,
2025-05-07T20:32:14.2599901Z         compiled: bool,
2025-05-07T20:32:14.2599982Z     ) -> None:
2025-05-07T20:32:14.2600082Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2600163Z     
2025-05-07T20:32:14.2600336Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2602243Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2602254Z 
2025-05-07T20:32:14.2602402Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2602408Z 
2025-05-07T20:32:14.2602532Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2602777Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2602859Z     T=4096,
2025-05-07T20:32:14.2602947Z     D=5120,
2025-05-07T20:32:14.2603033Z     scale_ub=None,
2025-05-07T20:32:14.2603123Z     contiguous=True,
2025-05-07T20:32:14.2603224Z     compiled=False,
2025-05-07T20:32:14.2603300Z )
2025-05-07T20:32:14.2603525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2603710Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.2603715Z 
2025-05-07T20:32:14.2603842Z     @given(
2025-05-07T20:32:14.2603974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2604077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2604236Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2604364Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2604485Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2604563Z     )
2025-05-07T20:32:14.2604821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2604919Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2605006Z         self,
2025-05-07T20:32:14.2605089Z         T: int,
2025-05-07T20:32:14.2605169Z         D: int,
2025-05-07T20:32:14.2605278Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2605372Z         contiguous: bool,
2025-05-07T20:32:14.2605507Z         compiled: bool,
2025-05-07T20:32:14.2605596Z     ) -> None:
2025-05-07T20:32:14.2605700Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2605777Z     
2025-05-07T20:32:14.2605959Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2607798Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2607807Z 
2025-05-07T20:32:14.2607937Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2607944Z 
2025-05-07T20:32:14.2608051Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2608289Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2608371Z     T=2048,
2025-05-07T20:32:14.2608452Z     D=5120,
2025-05-07T20:32:14.2608544Z     scale_ub=None,
2025-05-07T20:32:14.2608636Z     contiguous=False,
2025-05-07T20:32:14.2608726Z     compiled=False,
2025-05-07T20:32:14.2608807Z )
2025-05-07T20:32:14.2609033Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2609214Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.2609219Z 
2025-05-07T20:32:14.2609303Z     @given(
2025-05-07T20:32:14.2609427Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2609574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2609701Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2609824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2609952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2610032Z     )
2025-05-07T20:32:14.2610290Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2610394Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2610474Z         self,
2025-05-07T20:32:14.2610555Z         T: int,
2025-05-07T20:32:14.2610640Z         D: int,
2025-05-07T20:32:14.2610743Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2610837Z         contiguous: bool,
2025-05-07T20:32:14.2610932Z         compiled: bool,
2025-05-07T20:32:14.2611012Z     ) -> None:
2025-05-07T20:32:14.2611112Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2611193Z     
2025-05-07T20:32:14.2611367Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2613314Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2613360Z 
2025-05-07T20:32:14.2613485Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2613489Z 
2025-05-07T20:32:14.2613605Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2613835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2613917Z     T=4096,
2025-05-07T20:32:14.2614003Z     D=7168,
2025-05-07T20:32:14.2614093Z     scale_ub=None,
2025-05-07T20:32:14.2614184Z     contiguous=True,
2025-05-07T20:32:14.2614279Z     compiled=True,
2025-05-07T20:32:14.2614356Z )
2025-05-07T20:32:14.2614624Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2614811Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.2614815Z 
2025-05-07T20:32:14.2614896Z     @given(
2025-05-07T20:32:14.2615024Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2615131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2615250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2615377Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2615500Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2615577Z     )
2025-05-07T20:32:14.2615836Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2615935Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2616024Z         self,
2025-05-07T20:32:14.2616104Z         T: int,
2025-05-07T20:32:14.2616187Z         D: int,
2025-05-07T20:32:14.2616293Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2616390Z         contiguous: bool,
2025-05-07T20:32:14.2616480Z         compiled: bool,
2025-05-07T20:32:14.2616568Z     ) -> None:
2025-05-07T20:32:14.2616667Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2616744Z     
2025-05-07T20:32:14.2616923Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2618982Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2618989Z 
2025-05-07T20:32:14.2619126Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2619131Z 
2025-05-07T20:32:14.2619241Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2619479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2619560Z     T=2048,
2025-05-07T20:32:14.2619640Z     D=5120,
2025-05-07T20:32:14.2619737Z     scale_ub=1200.0,
2025-05-07T20:32:14.2619827Z     contiguous=False,
2025-05-07T20:32:14.2619913Z     compiled=False,
2025-05-07T20:32:14.2619998Z )
2025-05-07T20:32:14.2620225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2620408Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.2620413Z 
2025-05-07T20:32:14.2620499Z     @given(
2025-05-07T20:32:14.2620625Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2620730Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2620857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2620991Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2621164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2621245Z     )
2025-05-07T20:32:14.2621500Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2621605Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2621745Z         self,
2025-05-07T20:32:14.2621853Z         T: int,
2025-05-07T20:32:14.2621973Z         D: int,
2025-05-07T20:32:14.2622110Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2622230Z         contiguous: bool,
2025-05-07T20:32:14.2622331Z         compiled: bool,
2025-05-07T20:32:14.2622418Z     ) -> None:
2025-05-07T20:32:14.2622518Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2622603Z     
2025-05-07T20:32:14.2622799Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2624766Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2624838Z 
2025-05-07T20:32:14.2624965Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2624969Z 
2025-05-07T20:32:14.2625086Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2625320Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2625403Z     T=4096,
2025-05-07T20:32:14.2625505Z     D=7168,
2025-05-07T20:32:14.2625627Z     scale_ub=1200.0,
2025-05-07T20:32:14.2625753Z     contiguous=True,
2025-05-07T20:32:14.2625859Z     compiled=False,
2025-05-07T20:32:14.2625935Z )
2025-05-07T20:32:14.2626164Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2626373Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.2626380Z 
2025-05-07T20:32:14.2626492Z     @given(
2025-05-07T20:32:14.2626646Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2626752Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2626873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2627001Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2627122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2627201Z     )
2025-05-07T20:32:14.2627524Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2627624Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2627709Z         self,
2025-05-07T20:32:14.2627790Z         T: int,
2025-05-07T20:32:14.2627872Z         D: int,
2025-05-07T20:32:14.2627980Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2628074Z         contiguous: bool,
2025-05-07T20:32:14.2628164Z         compiled: bool,
2025-05-07T20:32:14.2628251Z     ) -> None:
2025-05-07T20:32:14.2628351Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2628426Z     
2025-05-07T20:32:14.2628610Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2630601Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2630610Z 
2025-05-07T20:32:14.2630792Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2630798Z 
2025-05-07T20:32:14.2630909Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2631148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2631270Z     T=16384,
2025-05-07T20:32:14.2631352Z     D=7168,
2025-05-07T20:32:14.2631444Z     scale_ub=None,
2025-05-07T20:32:14.2631536Z     contiguous=False,
2025-05-07T20:32:14.2631625Z     compiled=True,
2025-05-07T20:32:14.2631709Z )
2025-05-07T20:32:14.2631936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2632121Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.2632125Z 
2025-05-07T20:32:14.2632217Z     @given(
2025-05-07T20:32:14.2632343Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2632456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2632620Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2632749Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2632882Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2632961Z     )
2025-05-07T20:32:14.2633216Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2633327Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2633407Z         self,
2025-05-07T20:32:14.2633488Z         T: int,
2025-05-07T20:32:14.2633577Z         D: int,
2025-05-07T20:32:14.2633681Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2633777Z         contiguous: bool,
2025-05-07T20:32:14.2633873Z         compiled: bool,
2025-05-07T20:32:14.2633957Z     ) -> None:
2025-05-07T20:32:14.2634067Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2634145Z     
2025-05-07T20:32:14.2634320Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2636182Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2636193Z 
2025-05-07T20:32:14.2636320Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2636324Z 
2025-05-07T20:32:14.2636441Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2636717Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2636803Z     T=4096,
2025-05-07T20:32:14.2636893Z     D=7168,
2025-05-07T20:32:14.2636980Z     scale_ub=None,
2025-05-07T20:32:14.2637073Z     contiguous=True,
2025-05-07T20:32:14.2637169Z     compiled=False,
2025-05-07T20:32:14.2637247Z )
2025-05-07T20:32:14.2637481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2637660Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.2637667Z 
2025-05-07T20:32:14.2637768Z     @given(
2025-05-07T20:32:14.2637897Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2638004Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2638127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2644088Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2644234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2644321Z     )
2025-05-07T20:32:14.2644586Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2644695Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2644777Z         self,
2025-05-07T20:32:14.2644856Z         T: int,
2025-05-07T20:32:14.2644941Z         D: int,
2025-05-07T20:32:14.2645125Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2645222Z         contiguous: bool,
2025-05-07T20:32:14.2645319Z         compiled: bool,
2025-05-07T20:32:14.2645402Z     ) -> None:
2025-05-07T20:32:14.2645544Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2645630Z     
2025-05-07T20:32:14.2645807Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2647691Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2647738Z 
2025-05-07T20:32:14.2647864Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2647869Z 
2025-05-07T20:32:14.2647977Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2648224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2648309Z     T=16384,
2025-05-07T20:32:14.2648398Z     D=7168,
2025-05-07T20:32:14.2648486Z     scale_ub=None,
2025-05-07T20:32:14.2648585Z     contiguous=True,
2025-05-07T20:32:14.2648684Z     compiled=False,
2025-05-07T20:32:14.2648764Z )
2025-05-07T20:32:14.2648991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2649186Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.2649191Z 
2025-05-07T20:32:14.2649273Z     @given(
2025-05-07T20:32:14.2649398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2649514Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2649636Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2649771Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2649889Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2649968Z     )
2025-05-07T20:32:14.2650234Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2650334Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2650412Z         self,
2025-05-07T20:32:14.2650500Z         T: int,
2025-05-07T20:32:14.2650578Z         D: int,
2025-05-07T20:32:14.2650681Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2650783Z         contiguous: bool,
2025-05-07T20:32:14.2650871Z         compiled: bool,
2025-05-07T20:32:14.2650997Z     ) -> None:
2025-05-07T20:32:14.2651107Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2651183Z     
2025-05-07T20:32:14.2651364Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2653233Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2653243Z 
2025-05-07T20:32:14.2653379Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2653384Z 
2025-05-07T20:32:14.2653494Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2653726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2653813Z     T=16384,
2025-05-07T20:32:14.2653893Z     D=7168,
2025-05-07T20:32:14.2653979Z     scale_ub=1200.0,
2025-05-07T20:32:14.2654070Z     contiguous=True,
2025-05-07T20:32:14.2654204Z     compiled=False,
2025-05-07T20:32:14.2654281Z )
2025-05-07T20:32:14.2654510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2654692Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.2654736Z 
2025-05-07T20:32:14.2654821Z     @given(
2025-05-07T20:32:14.2654942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2655042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2655166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2655288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2655408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2655490Z     )
2025-05-07T20:32:14.2656152Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2656457Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2656535Z         self,
2025-05-07T20:32:14.2656616Z         T: int,
2025-05-07T20:32:14.2656702Z         D: int,
2025-05-07T20:32:14.2656806Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2656899Z         contiguous: bool,
2025-05-07T20:32:14.2656994Z         compiled: bool,
2025-05-07T20:32:14.2657077Z     ) -> None:
2025-05-07T20:32:14.2657176Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2657262Z     
2025-05-07T20:32:14.2657439Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2659379Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2659389Z 
2025-05-07T20:32:14.2659512Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2659517Z 
2025-05-07T20:32:14.2659626Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2659866Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2659947Z     T=128,
2025-05-07T20:32:14.2660038Z     D=5120,
2025-05-07T20:32:14.2660123Z     scale_ub=1200.0,
2025-05-07T20:32:14.2660212Z     contiguous=False,
2025-05-07T20:32:14.2660305Z     compiled=False,
2025-05-07T20:32:14.2660380Z )
2025-05-07T20:32:14.2660678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2660866Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.2660871Z 
2025-05-07T20:32:14.2660954Z     @given(
2025-05-07T20:32:14.2661075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2661188Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2661306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2661437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2661561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2661637Z     )
2025-05-07T20:32:14.2661897Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2661995Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2662074Z         self,
2025-05-07T20:32:14.2662161Z         T: int,
2025-05-07T20:32:14.2662240Z         D: int,
2025-05-07T20:32:14.2662345Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2662465Z         contiguous: bool,
2025-05-07T20:32:14.2662562Z         compiled: bool,
2025-05-07T20:32:14.2662659Z     ) -> None:
2025-05-07T20:32:14.2662771Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2662849Z     
2025-05-07T20:32:14.2663114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2663188Z     
2025-05-07T20:32:14.2663291Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2663422Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2663513Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2663666Z         x0 = x[:, :D]
2025-05-07T20:32:14.2663747Z         x1 = x[:, D:]
2025-05-07T20:32:14.2663821Z     
2025-05-07T20:32:14.2663915Z         if contiguous:
2025-05-07T20:32:14.2664012Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2664106Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2664184Z     
2025-05-07T20:32:14.2664277Z         if scale_ub is not None:
2025-05-07T20:32:14.2664394Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2664533Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2664608Z             )
2025-05-07T20:32:14.2664732Z         else:
2025-05-07T20:32:14.2664828Z             scale_ub_tensor = None
2025-05-07T20:32:14.2664903Z     
2025-05-07T20:32:14.2665046Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2665139Z             op = silu_mul_quant
2025-05-07T20:32:14.2665227Z             if compiled:
2025-05-07T20:32:14.2665336Z                 op = torch.compile(op)
2025-05-07T20:32:14.2665446Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2665520Z     
2025-05-07T20:32:14.2665620Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2665625Z 
2025-05-07T20:32:14.2665726Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2665863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2665968Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2666074Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2666598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2666701Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2667075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2667314Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2667669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2667774Z     kernel = self.compile(
2025-05-07T20:32:14.2668168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2668351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2668533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2668538Z 
2025-05-07T20:32:14.2668754Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeaccdcc0>
2025-05-07T20:32:14.2669569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2670101Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeae04940>}
2025-05-07T20:32:14.2670876Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2671084Z context = <triton._C.libtriton.ir.context object at 0x7faaeae249b0>
2025-05-07T20:32:14.2671089Z 
2025-05-07T20:32:14.2671267Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2671552Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2671668Z                            module_map=module_map)
2025-05-07T20:32:14.2671902Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2672016Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2672098Z E       ^
2025-05-07T20:32:14.2672494Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2672631Z 
2025-05-07T20:32:14.2673061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2673066Z 
2025-05-07T20:32:14.2673174Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2673411Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2673495Z     T=2048,
2025-05-07T20:32:14.2673573Z     D=7168,
2025-05-07T20:32:14.2673665Z     scale_ub=None,
2025-05-07T20:32:14.2673756Z     contiguous=False,
2025-05-07T20:32:14.2673883Z     compiled=False,
2025-05-07T20:32:14.2673964Z )
2025-05-07T20:32:14.2674193Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2674379Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.2674384Z 
2025-05-07T20:32:14.2674465Z     @given(
2025-05-07T20:32:14.2674589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2674700Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2674819Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2674939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2675066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2675142Z     )
2025-05-07T20:32:14.2675400Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2675498Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2675576Z         self,
2025-05-07T20:32:14.2675662Z         T: int,
2025-05-07T20:32:14.2675740Z         D: int,
2025-05-07T20:32:14.2675842Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2675938Z         contiguous: bool,
2025-05-07T20:32:14.2676026Z         compiled: bool,
2025-05-07T20:32:14.2676105Z     ) -> None:
2025-05-07T20:32:14.2676208Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2676282Z     
2025-05-07T20:32:14.2676458Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2678356Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2678365Z 
2025-05-07T20:32:14.2678490Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2678501Z 
2025-05-07T20:32:14.2678607Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2678837Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2678923Z     T=128,
2025-05-07T20:32:14.2679002Z     D=7168,
2025-05-07T20:32:14.2679086Z     scale_ub=1200.0,
2025-05-07T20:32:14.2679178Z     contiguous=True,
2025-05-07T20:32:14.2679265Z     compiled=True,
2025-05-07T20:32:14.2679337Z )
2025-05-07T20:32:14.2679571Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2679747Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.2679754Z 
2025-05-07T20:32:14.2679836Z     @given(
2025-05-07T20:32:14.2679967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2680073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2680199Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2680368Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2680489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2680572Z     )
2025-05-07T20:32:14.2680826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2680964Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2681052Z         self,
2025-05-07T20:32:14.2681130Z         T: int,
2025-05-07T20:32:14.2681210Z         D: int,
2025-05-07T20:32:14.2681318Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2681411Z         contiguous: bool,
2025-05-07T20:32:14.2681507Z         compiled: bool,
2025-05-07T20:32:14.2681590Z     ) -> None:
2025-05-07T20:32:14.2681692Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2681772Z     
2025-05-07T20:32:14.2681946Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2682063Z     
2025-05-07T20:32:14.2682163Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2682296Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2682386Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2682477Z         x0 = x[:, :D]
2025-05-07T20:32:14.2682561Z         x1 = x[:, D:]
2025-05-07T20:32:14.2682638Z     
2025-05-07T20:32:14.2682750Z         if contiguous:
2025-05-07T20:32:14.2682853Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2682966Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2683046Z     
2025-05-07T20:32:14.2683139Z         if scale_ub is not None:
2025-05-07T20:32:14.2683253Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2683392Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2683474Z             )
2025-05-07T20:32:14.2683556Z         else:
2025-05-07T20:32:14.2683655Z             scale_ub_tensor = None
2025-05-07T20:32:14.2683728Z     
2025-05-07T20:32:14.2683876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2683969Z             op = silu_mul_quant
2025-05-07T20:32:14.2684058Z             if compiled:
2025-05-07T20:32:14.2684174Z                 op = torch.compile(op)
2025-05-07T20:32:14.2684284Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2684360Z     
2025-05-07T20:32:14.2684460Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2684465Z 
2025-05-07T20:32:14.2684565Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2684704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2684808Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2684912Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2685349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2685450Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2685960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2686076Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2686444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2686684Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2687038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2687134Z     kernel = self.compile(
2025-05-07T20:32:14.2687535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2687716Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2687854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2687859Z 
2025-05-07T20:32:14.2688074Z self = <triton.compiler.compiler.ASTSource object at 0x7faaeaebb8e0>
2025-05-07T20:32:14.2688918Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2689511Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fad08f82ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7faaeae04dc0>}
2025-05-07T20:32:14.2690282Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2690494Z context = <triton._C.libtriton.ir.context object at 0x7faaeae5c2f0>
2025-05-07T20:32:14.2690498Z 
2025-05-07T20:32:14.2690672Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2690989Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2691108Z                            module_map=module_map)
2025-05-07T20:32:14.2691277Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2691391Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2691474Z E       ^
2025-05-07T20:32:14.2691839Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2691844Z 
2025-05-07T20:32:14.2692278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2692283Z 
2025-05-07T20:32:14.2692394Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2692664Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2692754Z     T=128,
2025-05-07T20:32:14.2692850Z     D=7168,
2025-05-07T20:32:14.2692945Z     scale_ub=1200.0,
2025-05-07T20:32:14.2693032Z     contiguous=True,
2025-05-07T20:32:14.2693116Z     compiled=False,
2025-05-07T20:32:14.2693196Z )
2025-05-07T20:32:14.2693423Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2693600Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.2693607Z 
2025-05-07T20:32:14.2693694Z     @given(
2025-05-07T20:32:14.2693816Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2693920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2694046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2694169Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2694292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2694412Z     )
2025-05-07T20:32:14.2694665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2694767Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2694848Z         self,
2025-05-07T20:32:14.2694925Z         T: int,
2025-05-07T20:32:14.2695011Z         D: int,
2025-05-07T20:32:14.2695113Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2695205Z         contiguous: bool,
2025-05-07T20:32:14.2695298Z         compiled: bool,
2025-05-07T20:32:14.2695377Z     ) -> None:
2025-05-07T20:32:14.2695476Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2695561Z     
2025-05-07T20:32:14.2695733Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2695814Z     
2025-05-07T20:32:14.2695908Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2696036Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2697931Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2697940Z 
2025-05-07T20:32:14.2698212Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:14.2698216Z 
2025-05-07T20:32:14.2698334Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2698567Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2698647Z     T=128,
2025-05-07T20:32:14.2698730Z     D=5120,
2025-05-07T20:32:14.2698817Z     scale_ub=1200.0,
2025-05-07T20:32:14.2698904Z     contiguous=True,
2025-05-07T20:32:14.2699000Z     compiled=True,
2025-05-07T20:32:14.2699075Z )
2025-05-07T20:32:14.2699303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2699541Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.2699546Z 
2025-05-07T20:32:14.2699625Z     @given(
2025-05-07T20:32:14.2699751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2699853Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2699970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2700101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2700215Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2700298Z     )
2025-05-07T20:32:14.2700550Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2700647Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2700729Z         self,
2025-05-07T20:32:14.2700810Z         T: int,
2025-05-07T20:32:14.2700890Z         D: int,
2025-05-07T20:32:14.2700998Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2701095Z         contiguous: bool,
2025-05-07T20:32:14.2701189Z         compiled: bool,
2025-05-07T20:32:14.2701277Z     ) -> None:
2025-05-07T20:32:14.2701375Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2701455Z     
2025-05-07T20:32:14.2701633Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2701709Z     
2025-05-07T20:32:14.2701803Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2701940Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2703837Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2703851Z 
2025-05-07T20:32:14.2703973Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:14.2703983Z 
2025-05-07T20:32:14.2704089Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2704326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2704407Z     T=128,
2025-05-07T20:32:14.2704486Z     D=7168,
2025-05-07T20:32:14.2704575Z     scale_ub=None,
2025-05-07T20:32:14.2704661Z     contiguous=True,
2025-05-07T20:32:14.2704746Z     compiled=True,
2025-05-07T20:32:14.2704824Z )
2025-05-07T20:32:14.2705046Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2705216Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.2705226Z 
2025-05-07T20:32:14.2705309Z     @given(
2025-05-07T20:32:14.2705428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2705535Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2705655Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2705819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2705943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2706027Z     )
2025-05-07T20:32:14.2706281Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2706423Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2706500Z         self,
2025-05-07T20:32:14.2706584Z         T: int,
2025-05-07T20:32:14.2706662Z         D: int,
2025-05-07T20:32:14.2706762Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2706859Z         contiguous: bool,
2025-05-07T20:32:14.2706948Z         compiled: bool,
2025-05-07T20:32:14.2707027Z     ) -> None:
2025-05-07T20:32:14.2707127Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2707223Z     
2025-05-07T20:32:14.2707405Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2709242Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2709295Z 
2025-05-07T20:32:14.2709424Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2709560Z =============================== warnings summary ===============================
2025-05-07T20:32:14.2709880Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:14.2710206Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:14.2710517Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:14.2711432Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:14.2711671Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:14.2711676Z 
2025-05-07T20:32:14.2711891Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:14.2712068Z ================= 1 failed, 1 deselected, 3 warnings in 21.66s =================
2025-05-07T20:32:15.8709447Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:15.9326994Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:15.9327273Z 
2025-05-07T20:32:17.9344991Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:20.0772447Z ============================= test session starts ==============================
2025-05-07T20:32:20.0773106Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:20.0773639Z cachedir: .pytest_cache
2025-05-07T20:32:20.0774225Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:20.0774982Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:20.0775400Z plugins: hypothesis-6.131.14
2025-05-07T20:32:21.6762057Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:21.8538795Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:21.8539231Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:21.8539469Z 
2025-05-07T20:32:24.3750966Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.3752346Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.3752828Z     T=1,
2025-05-07T20:32:24.3753032Z     D=5120,
2025-05-07T20:32:24.3753231Z     scale_ub=None,
2025-05-07T20:32:24.3753460Z     contiguous=True,
2025-05-07T20:32:24.3753694Z     compiled=True,
2025-05-07T20:32:24.3753907Z )
2025-05-07T20:32:24.3754251Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.3754753Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:24.3755025Z 
2025-05-07T20:32:24.3755232Z     @given(
2025-05-07T20:32:24.3755483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.3756035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.3756349Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.3756694Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.3757041Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.3757340Z     )
2025-05-07T20:32:24.3757698Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.3758153Z     def test_silu_mul_quant(
2025-05-07T20:32:24.3758407Z         self,
2025-05-07T20:32:24.3758607Z         T: int,
2025-05-07T20:32:24.3758816Z         D: int,
2025-05-07T20:32:24.3759047Z         scale_ub: Optional[float],
2025-05-07T20:32:24.3759326Z         contiguous: bool,
2025-05-07T20:32:24.3759579Z         compiled: bool,
2025-05-07T20:32:24.3759818Z     ) -> None:
2025-05-07T20:32:24.3760039Z         torch.manual_seed(2025)
2025-05-07T20:32:24.3760298Z     
2025-05-07T20:32:24.3760586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.3760936Z     
2025-05-07T20:32:24.3761137Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.3761438Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.3761752Z         x = x_sign * x_clamp
2025-05-07T20:32:24.3762007Z         x0 = x[:, :D]
2025-05-07T20:32:24.3762234Z         x1 = x[:, D:]
2025-05-07T20:32:24.3762449Z     
2025-05-07T20:32:24.3762639Z         if contiguous:
2025-05-07T20:32:24.3762882Z             x0 = x0.contiguous()
2025-05-07T20:32:24.3763153Z             x1 = x1.contiguous()
2025-05-07T20:32:24.3763395Z     
2025-05-07T20:32:24.3763595Z         if scale_ub is not None:
2025-05-07T20:32:24.3763976Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.3764319Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.3764635Z             )
2025-05-07T20:32:24.3764840Z         else:
2025-05-07T20:32:24.3765058Z             scale_ub_tensor = None
2025-05-07T20:32:24.3765318Z     
2025-05-07T20:32:24.3765565Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.3765883Z             op = silu_mul_quant
2025-05-07T20:32:24.3766145Z             if compiled:
2025-05-07T20:32:24.3766404Z                 op = torch.compile(op)
2025-05-07T20:32:24.3766711Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.3766995Z     
2025-05-07T20:32:24.3767202Z         y_fp8, y_scale = fn()
2025-05-07T20:32:24.3767493Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:24.3767791Z     
2025-05-07T20:32:24.3768041Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.3768389Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:24.3768691Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:24.3769018Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:24.3769389Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.3769705Z     
2025-05-07T20:32:24.3770003Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:24.3770205Z 
2025-05-07T20:32:24.3770317Z moe/activation_test.py:126: 
2025-05-07T20:32:24.3770618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.3771025Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:24.3771370Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.3772183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:24.3772953Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:24.3773523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.3774226Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.3775000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:24.3775740Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:24.3776511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:24.3777281Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:24.3778131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:24.3778792Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:24.3779413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:24.3779944Z     fn()
2025-05-07T20:32:24.3780460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:24.3781061Z     self.fn.run(
2025-05-07T20:32:24.3781544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.3782087Z     kernel = self.compile(
2025-05-07T20:32:24.3782643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.3783322Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.3783727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.3783955Z 
2025-05-07T20:32:24.3784169Z self = <triton.compiler.compiler.ASTSource object at 0x7fa89aabd390>
2025-05-07T20:32:24.3785330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.3786766Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89ab60af0>}
2025-05-07T20:32:24.3788139Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.3789187Z context = <triton._C.libtriton.ir.context object at 0x7fa8a0960af0>
2025-05-07T20:32:24.3789485Z 
2025-05-07T20:32:24.3789659Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.3790197Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.3790682Z                            module_map=module_map)
2025-05-07T20:32:24.3791052Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.3791427Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:24.3791706Z E       ^
2025-05-07T20:32:24.3792229Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.3792690Z 
2025-05-07T20:32:24.3793116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.3793687Z 
2025-05-07T20:32:24.3793798Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.3794225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.3794637Z     T=2048,
2025-05-07T20:32:24.3794830Z     D=5120,
2025-05-07T20:32:24.3795032Z     scale_ub=1200.0,
2025-05-07T20:32:24.3795268Z     contiguous=True,
2025-05-07T20:32:24.3795498Z     compiled=False,
2025-05-07T20:32:24.3795722Z )
2025-05-07T20:32:25.7364543Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.7366215Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.7366523Z 
2025-05-07T20:32:25.7366610Z     @given(
2025-05-07T20:32:25.7366853Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.7367174Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.7367507Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.7367849Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.7375904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.7376244Z     )
2025-05-07T20:32:25.7376614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.7377070Z     def test_silu_mul_quant(
2025-05-07T20:32:25.7377316Z         self,
2025-05-07T20:32:25.7377522Z         T: int,
2025-05-07T20:32:25.7377733Z         D: int,
2025-05-07T20:32:25.7377953Z         scale_ub: Optional[float],
2025-05-07T20:32:25.7378313Z         contiguous: bool,
2025-05-07T20:32:25.7378569Z         compiled: bool,
2025-05-07T20:32:25.7378802Z     ) -> None:
2025-05-07T20:32:25.7379029Z         torch.manual_seed(2025)
2025-05-07T20:32:25.7379282Z     
2025-05-07T20:32:25.7379562Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.7379916Z     
2025-05-07T20:32:25.7380118Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.7380414Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.7380730Z         x = x_sign * x_clamp
2025-05-07T20:32:25.7380976Z         x0 = x[:, :D]
2025-05-07T20:32:25.7381191Z         x1 = x[:, D:]
2025-05-07T20:32:25.7381407Z     
2025-05-07T20:32:25.7381602Z         if contiguous:
2025-05-07T20:32:25.7381833Z             x0 = x0.contiguous()
2025-05-07T20:32:25.7382098Z             x1 = x1.contiguous()
2025-05-07T20:32:25.7382341Z     
2025-05-07T20:32:25.7382695Z         if scale_ub is not None:
2025-05-07T20:32:25.7382974Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.7383318Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.7383635Z             )
2025-05-07T20:32:25.7383830Z         else:
2025-05-07T20:32:25.7384049Z             scale_ub_tensor = None
2025-05-07T20:32:25.7384305Z     
2025-05-07T20:32:25.7384543Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.7384866Z             op = silu_mul_quant
2025-05-07T20:32:25.7385131Z             if compiled:
2025-05-07T20:32:25.7385416Z                 op = torch.compile(op)
2025-05-07T20:32:25.7385742Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7386026Z     
2025-05-07T20:32:25.7386222Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.7386397Z 
2025-05-07T20:32:25.7386501Z moe/activation_test.py:117: 
2025-05-07T20:32:25.7386801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7387140Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.7387423Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7388209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.7388923Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.7389465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.7390234Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.7390914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.7391456Z     kernel = self.compile(
2025-05-07T20:32:25.7392008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.7392678Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.7393080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7393355Z 
2025-05-07T20:32:25.7393569Z self = <triton.compiler.compiler.ASTSource object at 0x7fa899b69540>
2025-05-07T20:32:25.7394674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.7396158Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89aa39990>}
2025-05-07T20:32:25.7397574Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.7398618Z context = <triton._C.libtriton.ir.context object at 0x7fa89b169070>
2025-05-07T20:32:25.7398907Z 
2025-05-07T20:32:25.7399083Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.7399618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.7400089Z                            module_map=module_map)
2025-05-07T20:32:25.7400460Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.7400825Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.7401080Z E       ^
2025-05-07T20:32:25.7401553Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.7402009Z 
2025-05-07T20:32:25.7402438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.7402955Z 
2025-05-07T20:32:25.7403116Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.7403539Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.7403945Z     T=2048,
2025-05-07T20:32:25.7404143Z     D=5120,
2025-05-07T20:32:25.7404334Z     scale_ub=1200.0,
2025-05-07T20:32:25.7404562Z     contiguous=True,
2025-05-07T20:32:25.7404791Z     compiled=True,
2025-05-07T20:32:25.7404995Z )
2025-05-07T20:32:25.7405322Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.7405823Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.7406117Z 
2025-05-07T20:32:25.7406203Z     @given(
2025-05-07T20:32:25.7406464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.7406782Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.7407095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.7407424Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.7407763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.7408052Z     )
2025-05-07T20:32:25.7408403Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.7408855Z     def test_silu_mul_quant(
2025-05-07T20:32:25.7409102Z         self,
2025-05-07T20:32:25.7409339Z         T: int,
2025-05-07T20:32:25.7409540Z         D: int,
2025-05-07T20:32:25.7409761Z         scale_ub: Optional[float],
2025-05-07T20:32:25.7410033Z         contiguous: bool,
2025-05-07T20:32:25.7410276Z         compiled: bool,
2025-05-07T20:32:25.7410543Z     ) -> None:
2025-05-07T20:32:25.7410756Z         torch.manual_seed(2025)
2025-05-07T20:32:25.7410999Z     
2025-05-07T20:32:25.7411278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.7411620Z     
2025-05-07T20:32:25.7411812Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.7412111Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.7412422Z         x = x_sign * x_clamp
2025-05-07T20:32:25.7412660Z         x0 = x[:, :D]
2025-05-07T20:32:25.7412880Z         x1 = x[:, D:]
2025-05-07T20:32:25.7413090Z     
2025-05-07T20:32:25.7413275Z         if contiguous:
2025-05-07T20:32:25.7413555Z             x0 = x0.contiguous()
2025-05-07T20:32:25.7413813Z             x1 = x1.contiguous()
2025-05-07T20:32:25.7414049Z     
2025-05-07T20:32:25.7414245Z         if scale_ub is not None:
2025-05-07T20:32:25.7414524Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.7414858Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.7415172Z             )
2025-05-07T20:32:25.7415367Z         else:
2025-05-07T20:32:25.7415573Z             scale_ub_tensor = None
2025-05-07T20:32:25.7415829Z     
2025-05-07T20:32:25.7416068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.7416385Z             op = silu_mul_quant
2025-05-07T20:32:25.7416632Z             if compiled:
2025-05-07T20:32:25.7416886Z                 op = torch.compile(op)
2025-05-07T20:32:25.7417193Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7417468Z     
2025-05-07T20:32:25.7417665Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.7417961Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.7418337Z     
2025-05-07T20:32:25.7418584Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.7418925Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.7419215Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.7419540Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.7419906Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.7420224Z     
2025-05-07T20:32:25.7420427Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.7420631Z 
2025-05-07T20:32:25.7420731Z moe/activation_test.py:126: 
2025-05-07T20:32:25.7421032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7421410Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.7421752Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.7422555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.7423315Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.7423870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.7424566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.7425269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.7426044Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.7426808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.7427566Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.7428306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.7428993Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.7429608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.7430191Z     fn()
2025-05-07T20:32:25.7430704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.7431292Z     self.fn.run(
2025-05-07T20:32:25.7431770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.7432311Z     kernel = self.compile(
2025-05-07T20:32:25.7432857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.7433523Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.7433965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7434191Z 
2025-05-07T20:32:25.7434409Z self = <triton.compiler.compiler.ASTSource object at 0x7fa89ac9c4f0>
2025-05-07T20:32:25.7435498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.7436948Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8996d96c0>}
2025-05-07T20:32:25.7438316Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.7439356Z context = <triton._C.libtriton.ir.context object at 0x7fa8995ed770>
2025-05-07T20:32:25.7439647Z 
2025-05-07T20:32:25.7439817Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.7440348Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.7440828Z                            module_map=module_map)
2025-05-07T20:32:25.7441200Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.7441557Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.7441830Z E       ^
2025-05-07T20:32:25.7442305Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.7442760Z 
2025-05-07T20:32:25.7443226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.7443750Z 
2025-05-07T20:32:25.7443855Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.7444280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.7444686Z     T=16384,
2025-05-07T20:32:25.7444880Z     D=7168,
2025-05-07T20:32:25.7445080Z     scale_ub=1200.0,
2025-05-07T20:32:25.7445307Z     contiguous=False,
2025-05-07T20:32:25.7445530Z     compiled=False,
2025-05-07T20:32:25.7445736Z )
2025-05-07T20:32:26.9387052Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.9387925Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:26.9388335Z 
2025-05-07T20:32:26.9388453Z     @given(
2025-05-07T20:32:26.9388756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.9389081Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.9389399Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.9389762Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.9390101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.9390408Z     )
2025-05-07T20:32:26.9390773Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.9391530Z     def test_silu_mul_quant(
2025-05-07T20:32:26.9391786Z         self,
2025-05-07T20:32:26.9391996Z         T: int,
2025-05-07T20:32:26.9392203Z         D: int,
2025-05-07T20:32:26.9392426Z         scale_ub: Optional[float],
2025-05-07T20:32:26.9392801Z         contiguous: bool,
2025-05-07T20:32:26.9393053Z         compiled: bool,
2025-05-07T20:32:26.9393283Z     ) -> None:
2025-05-07T20:32:26.9393507Z         torch.manual_seed(2025)
2025-05-07T20:32:26.9393760Z     
2025-05-07T20:32:26.9394039Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.9394390Z     
2025-05-07T20:32:26.9394591Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.9394891Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.9395209Z         x = x_sign * x_clamp
2025-05-07T20:32:26.9395454Z         x0 = x[:, :D]
2025-05-07T20:32:26.9395788Z         x1 = x[:, D:]
2025-05-07T20:32:26.9396026Z     
2025-05-07T20:32:26.9396224Z         if contiguous:
2025-05-07T20:32:26.9396459Z             x0 = x0.contiguous()
2025-05-07T20:32:26.9396728Z             x1 = x1.contiguous()
2025-05-07T20:32:26.9396977Z     
2025-05-07T20:32:26.9397180Z         if scale_ub is not None:
2025-05-07T20:32:26.9397464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.9397814Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.9398135Z             )
2025-05-07T20:32:26.9398332Z         else:
2025-05-07T20:32:26.9398549Z             scale_ub_tensor = None
2025-05-07T20:32:26.9398810Z     
2025-05-07T20:32:26.9399049Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.9399371Z             op = silu_mul_quant
2025-05-07T20:32:26.9399631Z             if compiled:
2025-05-07T20:32:26.9399884Z                 op = torch.compile(op)
2025-05-07T20:32:26.9400196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.9400489Z     
2025-05-07T20:32:26.9400690Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.9400869Z 
2025-05-07T20:32:26.9400977Z moe/activation_test.py:117: 
2025-05-07T20:32:26.9401282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.9401621Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.9401911Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.9402630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.9403346Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.9403895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.9404742Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.9405427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.9406017Z     kernel = self.compile(
2025-05-07T20:32:26.9406580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.9407255Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.9407665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.9407896Z 
2025-05-07T20:32:26.9408114Z self = <triton.compiler.compiler.ASTSource object at 0x7fa899812aa0>
2025-05-07T20:32:26.9409222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.9410652Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8996d88b0>}
2025-05-07T20:32:26.9412078Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.9413131Z context = <triton._C.libtriton.ir.context object at 0x7fa8994ffdb0>
2025-05-07T20:32:26.9413466Z 
2025-05-07T20:32:26.9413639Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.9414178Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.9414659Z                            module_map=module_map)
2025-05-07T20:32:26.9415035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.9415396Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.9415688Z E       ^
2025-05-07T20:32:26.9416192Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.9416693Z 
2025-05-07T20:32:26.9417121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.9417647Z 
2025-05-07T20:32:26.9417754Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.9418268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.9418685Z     T=1,
2025-05-07T20:32:26.9418873Z     D=7168,
2025-05-07T20:32:26.9419075Z     scale_ub=None,
2025-05-07T20:32:26.9419302Z     contiguous=True,
2025-05-07T20:32:26.9419527Z     compiled=True,
2025-05-07T20:32:26.9419743Z )
2025-05-07T20:32:26.9420075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.9420569Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:26.9420838Z 
2025-05-07T20:32:26.9420919Z     @given(
2025-05-07T20:32:26.9421162Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.9421486Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.9421799Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.9422138Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.9422478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.9422765Z     )
2025-05-07T20:32:26.9423128Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.9423581Z     def test_silu_mul_quant(
2025-05-07T20:32:26.9423826Z         self,
2025-05-07T20:32:26.9424030Z         T: int,
2025-05-07T20:32:26.9424236Z         D: int,
2025-05-07T20:32:26.9424459Z         scale_ub: Optional[float],
2025-05-07T20:32:26.9424742Z         contiguous: bool,
2025-05-07T20:32:26.9424992Z         compiled: bool,
2025-05-07T20:32:26.9425271Z     ) -> None:
2025-05-07T20:32:26.9425502Z         torch.manual_seed(2025)
2025-05-07T20:32:26.9425781Z     
2025-05-07T20:32:26.9426085Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.9426439Z     
2025-05-07T20:32:26.9426647Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.9426953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.9427268Z         x = x_sign * x_clamp
2025-05-07T20:32:26.9427516Z         x0 = x[:, :D]
2025-05-07T20:32:26.9427746Z         x1 = x[:, D:]
2025-05-07T20:32:26.9427959Z     
2025-05-07T20:32:26.9428160Z         if contiguous:
2025-05-07T20:32:26.9428406Z             x0 = x0.contiguous()
2025-05-07T20:32:26.9428668Z             x1 = x1.contiguous()
2025-05-07T20:32:26.9428917Z     
2025-05-07T20:32:26.9429119Z         if scale_ub is not None:
2025-05-07T20:32:26.9429401Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.9429747Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.9430068Z             )
2025-05-07T20:32:26.9430262Z         else:
2025-05-07T20:32:26.9430482Z             scale_ub_tensor = None
2025-05-07T20:32:26.9430748Z     
2025-05-07T20:32:26.9430984Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.9431357Z             op = silu_mul_quant
2025-05-07T20:32:26.9431616Z             if compiled:
2025-05-07T20:32:26.9431873Z                 op = torch.compile(op)
2025-05-07T20:32:26.9432180Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.9432501Z     
2025-05-07T20:32:26.9432705Z         y_fp8, y_scale = fn()
2025-05-07T20:32:26.9432994Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:26.9433290Z     
2025-05-07T20:32:26.9433540Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.9433881Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:26.9434182Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:26.9434512Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:26.9434876Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.9435192Z     
2025-05-07T20:32:26.9435448Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:26.9435651Z 
2025-05-07T20:32:26.9435765Z moe/activation_test.py:126: 
2025-05-07T20:32:26.9436114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.9436456Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:26.9436800Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.9437602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:26.9438372Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:26.9438934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.9439637Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.9440337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:26.9441081Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:26.9441851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:26.9442615Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:26.9443357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:26.9444012Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:26.9444628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:26.9445152Z     fn()
2025-05-07T20:32:26.9445723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:26.9446322Z     self.fn.run(
2025-05-07T20:32:26.9446807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.9447343Z     kernel = self.compile(
2025-05-07T20:32:26.9447903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.9448578Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.9448982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.9449220Z 
2025-05-07T20:32:26.9449434Z self = <triton.compiler.compiler.ASTSource object at 0x7fa899978610>
2025-05-07T20:32:26.9450542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.9451986Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89937c940>}
2025-05-07T20:32:26.9453358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.9454443Z context = <triton._C.libtriton.ir.context object at 0x7fa8991ace30>
2025-05-07T20:32:26.9454740Z 
2025-05-07T20:32:26.9454911Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.9455446Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.9456178Z                            module_map=module_map)
2025-05-07T20:32:26.9456550Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.9456918Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:26.9457195Z E       ^
2025-05-07T20:32:26.9457745Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.9458285Z 
2025-05-07T20:32:26.9458709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.9459233Z 
2025-05-07T20:32:26.9459345Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.9459773Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.9460175Z     T=4096,
2025-05-07T20:32:26.9460370Z     D=5120,
2025-05-07T20:32:26.9460572Z     scale_ub=None,
2025-05-07T20:32:26.9460791Z     contiguous=False,
2025-05-07T20:32:26.9461027Z     compiled=False,
2025-05-07T20:32:26.9461240Z )
2025-05-07T20:32:28.5317525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.5318292Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.5318702Z 
2025-05-07T20:32:28.5318843Z     @given(
2025-05-07T20:32:28.5319204Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.5319634Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.5320070Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.5320539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.5321002Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.5321295Z     )
2025-05-07T20:32:28.5321665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.5322123Z     def test_silu_mul_quant(
2025-05-07T20:32:28.5329497Z         self,
2025-05-07T20:32:28.5329732Z         T: int,
2025-05-07T20:32:28.5329928Z         D: int,
2025-05-07T20:32:28.5330137Z         scale_ub: Optional[float],
2025-05-07T20:32:28.5330539Z         contiguous: bool,
2025-05-07T20:32:28.5330779Z         compiled: bool,
2025-05-07T20:32:28.5330996Z     ) -> None:
2025-05-07T20:32:28.5331209Z         torch.manual_seed(2025)
2025-05-07T20:32:28.5331443Z     
2025-05-07T20:32:28.5331719Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5332067Z     
2025-05-07T20:32:28.5332264Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.5332554Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.5332869Z         x = x_sign * x_clamp
2025-05-07T20:32:28.5333117Z         x0 = x[:, :D]
2025-05-07T20:32:28.5333330Z         x1 = x[:, D:]
2025-05-07T20:32:28.5333545Z     
2025-05-07T20:32:28.5333738Z         if contiguous:
2025-05-07T20:32:28.5334004Z             x0 = x0.contiguous()
2025-05-07T20:32:28.5334269Z             x1 = x1.contiguous()
2025-05-07T20:32:28.5334505Z     
2025-05-07T20:32:28.5334705Z         if scale_ub is not None:
2025-05-07T20:32:28.5334990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.5335328Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.5335643Z             )
2025-05-07T20:32:28.5335845Z         else:
2025-05-07T20:32:28.5336066Z             scale_ub_tensor = None
2025-05-07T20:32:28.5336387Z     
2025-05-07T20:32:28.5336634Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5336956Z             op = silu_mul_quant
2025-05-07T20:32:28.5337204Z             if compiled:
2025-05-07T20:32:28.5337524Z                 op = torch.compile(op)
2025-05-07T20:32:28.5337830Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5338230Z     
2025-05-07T20:32:28.5338437Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.5338604Z 
2025-05-07T20:32:28.5338714Z moe/activation_test.py:117: 
2025-05-07T20:32:28.5339009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5339346Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.5339638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5340346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.5341134Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.5341687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.5342534Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.5343212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.5343760Z     kernel = self.compile(
2025-05-07T20:32:28.5344317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.5344991Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.5345389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5345625Z 
2025-05-07T20:32:28.5345838Z self = <triton.compiler.compiler.ASTSource object at 0x7fa899358a90>
2025-05-07T20:32:28.5346946Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.5348355Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89937d7e0>}
2025-05-07T20:32:28.5349732Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.5350779Z context = <triton._C.libtriton.ir.context object at 0x7fa8991ff6f0>
2025-05-07T20:32:28.5351135Z 
2025-05-07T20:32:28.5351305Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.5351841Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.5352321Z                            module_map=module_map)
2025-05-07T20:32:28.5352691Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.5353184Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.5353495Z E       ^
2025-05-07T20:32:28.5354060Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.5354518Z 
2025-05-07T20:32:28.5354945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.5355463Z 
2025-05-07T20:32:28.5355837Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.5356269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5356674Z     T=4096,
2025-05-07T20:32:28.5356866Z     D=7168,
2025-05-07T20:32:28.5357051Z     scale_ub=None,
2025-05-07T20:32:28.5357273Z     contiguous=False,
2025-05-07T20:32:28.5357504Z     compiled=False,
2025-05-07T20:32:28.5357704Z )
2025-05-07T20:32:28.5358139Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.5358645Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.5358918Z 
2025-05-07T20:32:28.5359055Z     @given(
2025-05-07T20:32:28.5359288Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.5359603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.5359907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.5360242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.5360577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.5360868Z     )
2025-05-07T20:32:28.5361223Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.5361675Z     def test_silu_mul_quant(
2025-05-07T20:32:28.5361988Z         self,
2025-05-07T20:32:28.5362179Z         T: int,
2025-05-07T20:32:28.5362377Z         D: int,
2025-05-07T20:32:28.5362604Z         scale_ub: Optional[float],
2025-05-07T20:32:28.5362872Z         contiguous: bool,
2025-05-07T20:32:28.5363115Z         compiled: bool,
2025-05-07T20:32:28.5363343Z     ) -> None:
2025-05-07T20:32:28.5363557Z         torch.manual_seed(2025)
2025-05-07T20:32:28.5363805Z     
2025-05-07T20:32:28.5364086Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5364425Z     
2025-05-07T20:32:28.5364623Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.5364918Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.5365226Z         x = x_sign * x_clamp
2025-05-07T20:32:28.5365467Z         x0 = x[:, :D]
2025-05-07T20:32:28.5365683Z         x1 = x[:, D:]
2025-05-07T20:32:28.5365895Z     
2025-05-07T20:32:28.5366077Z         if contiguous:
2025-05-07T20:32:28.5366308Z             x0 = x0.contiguous()
2025-05-07T20:32:28.5366569Z             x1 = x1.contiguous()
2025-05-07T20:32:28.5366803Z     
2025-05-07T20:32:28.5366996Z         if scale_ub is not None:
2025-05-07T20:32:28.5367271Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.5367605Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.5367916Z             )
2025-05-07T20:32:28.5368116Z         else:
2025-05-07T20:32:28.5368322Z             scale_ub_tensor = None
2025-05-07T20:32:28.5368576Z     
2025-05-07T20:32:28.5368815Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5369125Z             op = silu_mul_quant
2025-05-07T20:32:28.5369375Z             if compiled:
2025-05-07T20:32:28.5369623Z                 op = torch.compile(op)
2025-05-07T20:32:28.5369916Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5370260Z     
2025-05-07T20:32:28.5370458Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.5370625Z 
2025-05-07T20:32:28.5370729Z moe/activation_test.py:117: 
2025-05-07T20:32:28.5371024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5371356Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.5371645Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5372341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.5373043Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.5373587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.5374278Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.5374942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.5375488Z     kernel = self.compile(
2025-05-07T20:32:28.5376043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.5376705Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.5377149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5377384Z 
2025-05-07T20:32:28.5377594Z self = <triton.compiler.compiler.ASTSource object at 0x7fa8993ba860>
2025-05-07T20:32:28.5378809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.5380197Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89937dfc0>}
2025-05-07T20:32:28.5381569Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.5382657Z context = <triton._C.libtriton.ir.context object at 0x7fa898bc6770>
2025-05-07T20:32:28.5382946Z 
2025-05-07T20:32:28.5383122Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.5383656Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.5384131Z                            module_map=module_map)
2025-05-07T20:32:28.5384501Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.5384863Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.5385122Z E       ^
2025-05-07T20:32:28.5385595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.5386052Z 
2025-05-07T20:32:28.5386506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.5387050Z 
2025-05-07T20:32:28.5387163Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.5387581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5387986Z     T=128,
2025-05-07T20:32:28.5388176Z     D=7168,
2025-05-07T20:32:28.5388371Z     scale_ub=None,
2025-05-07T20:32:28.5388587Z     contiguous=False,
2025-05-07T20:32:28.5388819Z     compiled=True,
2025-05-07T20:32:28.5389015Z )
2025-05-07T20:32:28.6003292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6003957Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.6004242Z 
2025-05-07T20:32:28.6004321Z     @given(
2025-05-07T20:32:28.6004563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6005001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6005321Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6005659Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6005995Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6006288Z     )
2025-05-07T20:32:28.6006648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6007091Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6007337Z         self,
2025-05-07T20:32:28.6007537Z         T: int,
2025-05-07T20:32:28.6007734Z         D: int,
2025-05-07T20:32:28.6007960Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6008242Z         contiguous: bool,
2025-05-07T20:32:28.6008489Z         compiled: bool,
2025-05-07T20:32:28.6008715Z     ) -> None:
2025-05-07T20:32:28.6008938Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6009186Z     
2025-05-07T20:32:28.6009463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6009822Z     
2025-05-07T20:32:28.6010024Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6010319Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6010636Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6010881Z         x0 = x[:, :D]
2025-05-07T20:32:28.6011168Z         x1 = x[:, D:]
2025-05-07T20:32:28.6011385Z     
2025-05-07T20:32:28.6011579Z         if contiguous:
2025-05-07T20:32:28.6011810Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6012077Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6012407Z     
2025-05-07T20:32:28.6012599Z         if scale_ub is not None:
2025-05-07T20:32:28.6012882Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6013225Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6013537Z             )
2025-05-07T20:32:28.6013733Z         else:
2025-05-07T20:32:28.6013949Z             scale_ub_tensor = None
2025-05-07T20:32:28.6014215Z     
2025-05-07T20:32:28.6014452Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6014773Z             op = silu_mul_quant
2025-05-07T20:32:28.6015031Z             if compiled:
2025-05-07T20:32:28.6015349Z                 op = torch.compile(op)
2025-05-07T20:32:28.6015657Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6015936Z     
2025-05-07T20:32:28.6016142Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.6016429Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.6016724Z     
2025-05-07T20:32:28.6016999Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6017455Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.6017762Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.6018168Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.6018533Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6018853Z     
2025-05-07T20:32:28.6019071Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.6019272Z 
2025-05-07T20:32:28.6019381Z moe/activation_test.py:126: 
2025-05-07T20:32:28.6019682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6020025Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.6020367Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6021170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.6021946Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.6022509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6023210Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6023911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.6024713Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6025490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.6026256Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6027002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.6027663Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.6028281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.6028807Z     fn()
2025-05-07T20:32:28.6029332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.6029926Z     self.fn.run(
2025-05-07T20:32:28.6030415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6030955Z     kernel = self.compile(
2025-05-07T20:32:28.6031559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6032234Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6032635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6032911Z 
2025-05-07T20:32:28.6033123Z self = <triton.compiler.compiler.ASTSource object at 0x7fa899359750>
2025-05-07T20:32:28.6034229Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6035638Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89939a560>}
2025-05-07T20:32:28.6037016Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6038105Z context = <triton._C.libtriton.ir.context object at 0x7fa898a17cb0>
2025-05-07T20:32:28.6038405Z 
2025-05-07T20:32:28.6038578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6039117Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6039604Z                            module_map=module_map)
2025-05-07T20:32:28.6039978Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6040354Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.6040632Z E       ^
2025-05-07T20:32:28.6041109Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6041582Z 
2025-05-07T20:32:28.6042007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6042545Z 
2025-05-07T20:32:28.6042655Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6043085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6043492Z     T=128,
2025-05-07T20:32:28.6043694Z     D=7168,
2025-05-07T20:32:28.6043898Z     scale_ub=None,
2025-05-07T20:32:28.6044119Z     contiguous=False,
2025-05-07T20:32:28.6044357Z     compiled=False,
2025-05-07T20:32:28.6044575Z )
2025-05-07T20:32:28.9611148Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.9612498Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.9613058Z 
2025-05-07T20:32:28.9613217Z     @given(
2025-05-07T20:32:28.9613908Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.9614543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.9615172Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.9615856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.9616319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.9616641Z     )
2025-05-07T20:32:28.9617003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.9617457Z     def test_silu_mul_quant(
2025-05-07T20:32:28.9617706Z         self,
2025-05-07T20:32:28.9617910Z         T: int,
2025-05-07T20:32:28.9618216Z         D: int,
2025-05-07T20:32:28.9618446Z         scale_ub: Optional[float],
2025-05-07T20:32:28.9618729Z         contiguous: bool,
2025-05-07T20:32:28.9618979Z         compiled: bool,
2025-05-07T20:32:28.9619207Z     ) -> None:
2025-05-07T20:32:28.9619435Z         torch.manual_seed(2025)
2025-05-07T20:32:28.9619691Z     
2025-05-07T20:32:28.9619972Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.9620328Z     
2025-05-07T20:32:28.9620534Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.9620833Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.9621220Z         x = x_sign * x_clamp
2025-05-07T20:32:28.9621471Z         x0 = x[:, :D]
2025-05-07T20:32:28.9621692Z         x1 = x[:, D:]
2025-05-07T20:32:28.9621907Z     
2025-05-07T20:32:28.9622105Z         if contiguous:
2025-05-07T20:32:28.9622398Z             x0 = x0.contiguous()
2025-05-07T20:32:28.9622667Z             x1 = x1.contiguous()
2025-05-07T20:32:28.9622917Z     
2025-05-07T20:32:28.9623113Z         if scale_ub is not None:
2025-05-07T20:32:28.9623395Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.9623742Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.9624054Z             )
2025-05-07T20:32:28.9624254Z         else:
2025-05-07T20:32:28.9624475Z             scale_ub_tensor = None
2025-05-07T20:32:28.9624735Z     
2025-05-07T20:32:28.9624969Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.9625367Z             op = silu_mul_quant
2025-05-07T20:32:28.9625623Z             if compiled:
2025-05-07T20:32:28.9625875Z                 op = torch.compile(op)
2025-05-07T20:32:28.9626182Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.9626504Z     
2025-05-07T20:32:28.9626709Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.9626888Z 
2025-05-07T20:32:28.9626995Z moe/activation_test.py:117: 
2025-05-07T20:32:28.9627302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.9627640Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.9627933Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.9628644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.9629360Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.9629910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.9630614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.9631293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.9631837Z     kernel = self.compile(
2025-05-07T20:32:28.9632401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.9633076Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.9633482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.9633710Z 
2025-05-07T20:32:28.9633925Z self = <triton.compiler.compiler.ASTSource object at 0x7fa898f46890>
2025-05-07T20:32:28.9635082Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.9636503Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8993ed7e0>}
2025-05-07T20:32:28.9637883Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.9638940Z context = <triton._C.libtriton.ir.context object at 0x7fa898a696f0>
2025-05-07T20:32:28.9639234Z 
2025-05-07T20:32:28.9639403Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.9639942Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.9640426Z                            module_map=module_map)
2025-05-07T20:32:28.9640794Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.9641165Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.9641435Z E       ^
2025-05-07T20:32:28.9641960Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.9642423Z 
2025-05-07T20:32:28.9642849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.9643423Z 
2025-05-07T20:32:28.9643532Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.9643957Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.9644367Z     T=4096,
2025-05-07T20:32:28.9644559Z     D=5120,
2025-05-07T20:32:28.9644757Z     scale_ub=1200.0,
2025-05-07T20:32:28.9644989Z     contiguous=True,
2025-05-07T20:32:28.9645218Z     compiled=False,
2025-05-07T20:32:28.9645431Z )
2025-05-07T20:32:28.9645765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.9646344Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.9646652Z 
2025-05-07T20:32:28.9646731Z     @given(
2025-05-07T20:32:28.9646967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.9647282Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.9647603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.9647940Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.9648278Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.9648566Z     )
2025-05-07T20:32:28.9648924Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.9649374Z     def test_silu_mul_quant(
2025-05-07T20:32:28.9649616Z         self,
2025-05-07T20:32:28.9649821Z         T: int,
2025-05-07T20:32:28.9650028Z         D: int,
2025-05-07T20:32:28.9650250Z         scale_ub: Optional[float],
2025-05-07T20:32:28.9650529Z         contiguous: bool,
2025-05-07T20:32:28.9650780Z         compiled: bool,
2025-05-07T20:32:28.9651004Z     ) -> None:
2025-05-07T20:32:28.9651230Z         torch.manual_seed(2025)
2025-05-07T20:32:28.9651479Z     
2025-05-07T20:32:28.9651755Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.9652104Z     
2025-05-07T20:32:28.9652307Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.9652603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.9652917Z         x = x_sign * x_clamp
2025-05-07T20:32:28.9653165Z         x0 = x[:, :D]
2025-05-07T20:32:28.9653386Z         x1 = x[:, D:]
2025-05-07T20:32:28.9653595Z     
2025-05-07T20:32:28.9653789Z         if contiguous:
2025-05-07T20:32:28.9654024Z             x0 = x0.contiguous()
2025-05-07T20:32:28.9654281Z             x1 = x1.contiguous()
2025-05-07T20:32:28.9654576Z     
2025-05-07T20:32:28.9654778Z         if scale_ub is not None:
2025-05-07T20:32:28.9655060Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.9655408Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.9655899Z             )
2025-05-07T20:32:28.9656097Z         else:
2025-05-07T20:32:28.9656319Z             scale_ub_tensor = None
2025-05-07T20:32:28.9656579Z     
2025-05-07T20:32:28.9656817Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.9663126Z             op = silu_mul_quant
2025-05-07T20:32:28.9663421Z             if compiled:
2025-05-07T20:32:28.9663684Z                 op = torch.compile(op)
2025-05-07T20:32:28.9663981Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.9664256Z     
2025-05-07T20:32:28.9664448Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.9664614Z 
2025-05-07T20:32:28.9664714Z moe/activation_test.py:117: 
2025-05-07T20:32:28.9665010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.9665337Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.9665629Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.9666430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.9667134Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.9667678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.9668421Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.9669100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.9669644Z     kernel = self.compile(
2025-05-07T20:32:28.9670189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.9670852Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.9671251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.9671542Z 
2025-05-07T20:32:28.9671758Z self = <triton.compiler.compiler.ASTSource object at 0x7fa898b664a0>
2025-05-07T20:32:28.9672849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.9674242Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8993edf30>}
2025-05-07T20:32:28.9675609Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.9676655Z context = <triton._C.libtriton.ir.context object at 0x7fa898ac28f0>
2025-05-07T20:32:28.9676945Z 
2025-05-07T20:32:28.9677121Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.9677648Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.9678125Z                            module_map=module_map)
2025-05-07T20:32:28.9678498Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.9678868Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.9679129Z E       ^
2025-05-07T20:32:28.9679604Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.9680058Z 
2025-05-07T20:32:28.9680484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.9681000Z 
2025-05-07T20:32:28.9681171Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.9681595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.9682005Z     T=1,
2025-05-07T20:32:28.9682193Z     D=5120,
2025-05-07T20:32:28.9682384Z     scale_ub=None,
2025-05-07T20:32:28.9682607Z     contiguous=True,
2025-05-07T20:32:28.9682837Z     compiled=True,
2025-05-07T20:32:28.9683039Z )
2025-05-07T20:32:29.5451938Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5453247Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.5453789Z 
2025-05-07T20:32:29.5453950Z     @given(
2025-05-07T20:32:29.5454424Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5455048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5455963Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5456441Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5456779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5457073Z     )
2025-05-07T20:32:29.5457437Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5457891Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5458218Z         self,
2025-05-07T20:32:29.5458537Z         T: int,
2025-05-07T20:32:29.5458743Z         D: int,
2025-05-07T20:32:29.5458963Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5459244Z         contiguous: bool,
2025-05-07T20:32:29.5459549Z         compiled: bool,
2025-05-07T20:32:29.5459775Z     ) -> None:
2025-05-07T20:32:29.5459997Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5460246Z     
2025-05-07T20:32:29.5460523Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5460873Z     
2025-05-07T20:32:29.5461073Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5461366Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5461686Z         x = x_sign * x_clamp
2025-05-07T20:32:29.5461931Z         x0 = x[:, :D]
2025-05-07T20:32:29.5462146Z         x1 = x[:, D:]
2025-05-07T20:32:29.5462358Z     
2025-05-07T20:32:29.5462622Z         if contiguous:
2025-05-07T20:32:29.5462855Z             x0 = x0.contiguous()
2025-05-07T20:32:29.5463123Z             x1 = x1.contiguous()
2025-05-07T20:32:29.5463370Z     
2025-05-07T20:32:29.5463573Z         if scale_ub is not None:
2025-05-07T20:32:29.5463850Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.5464195Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.5464512Z             )
2025-05-07T20:32:29.5464705Z         else:
2025-05-07T20:32:29.5464920Z             scale_ub_tensor = None
2025-05-07T20:32:29.5465180Z     
2025-05-07T20:32:29.5465417Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5465740Z             op = silu_mul_quant
2025-05-07T20:32:29.5465995Z             if compiled:
2025-05-07T20:32:29.5466245Z                 op = torch.compile(op)
2025-05-07T20:32:29.5466553Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5466834Z     
2025-05-07T20:32:29.5467033Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.5467330Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.5467627Z     
2025-05-07T20:32:29.5467876Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5468213Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.5468512Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.5468836Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.5469198Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.5469515Z     
2025-05-07T20:32:29.5469724Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.5469926Z 
2025-05-07T20:32:29.5470030Z moe/activation_test.py:126: 
2025-05-07T20:32:29.5470329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5470737Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.5471076Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.5471885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.5472657Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.5473218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.5473912Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.5474615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.5475357Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.5476123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:29.5476931Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.5477725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.5478379Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.5478991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.5479554Z     fn()
2025-05-07T20:32:29.5480074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.5480667Z     self.fn.run(
2025-05-07T20:32:29.5481143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.5481689Z     kernel = self.compile(
2025-05-07T20:32:29.5482246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.5482912Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.5483353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5483590Z 
2025-05-07T20:32:29.5483805Z self = <triton.compiler.compiler.ASTSource object at 0x7fa898e6ae90>
2025-05-07T20:32:29.5484912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.5486324Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8993ef0a0>}
2025-05-07T20:32:29.5487693Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.5488739Z context = <triton._C.libtriton.ir.context object at 0x7fa8986fec70>
2025-05-07T20:32:29.5489038Z 
2025-05-07T20:32:29.5489212Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.5489747Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.5490220Z                            module_map=module_map)
2025-05-07T20:32:29.5490597Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.5490964Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.5491242Z E       ^
2025-05-07T20:32:29.5491712Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.5492175Z 
2025-05-07T20:32:29.5492642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.5493163Z 
2025-05-07T20:32:29.5493275Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5493696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5494104Z     T=2048,
2025-05-07T20:32:29.5494300Z     D=5120,
2025-05-07T20:32:29.5494496Z     scale_ub=None,
2025-05-07T20:32:29.5494711Z     contiguous=True,
2025-05-07T20:32:29.5494937Z     compiled=True,
2025-05-07T20:32:29.5495142Z )
2025-05-07T20:32:30.0850565Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.0851245Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:30.0851524Z 
2025-05-07T20:32:30.0851612Z     @given(
2025-05-07T20:32:30.0851854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.0852176Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.0852502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.0852845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.0853182Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.0853478Z     )
2025-05-07T20:32:30.0853960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.0854420Z     def test_silu_mul_quant(
2025-05-07T20:32:30.0854671Z         self,
2025-05-07T20:32:30.0854875Z         T: int,
2025-05-07T20:32:30.0855072Z         D: int,
2025-05-07T20:32:30.0855371Z         scale_ub: Optional[float],
2025-05-07T20:32:30.0855997Z         contiguous: bool,
2025-05-07T20:32:30.0856244Z         compiled: bool,
2025-05-07T20:32:30.0856479Z     ) -> None:
2025-05-07T20:32:30.0856705Z         torch.manual_seed(2025)
2025-05-07T20:32:30.0856958Z     
2025-05-07T20:32:30.0857238Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.0857592Z     
2025-05-07T20:32:30.0857795Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.0858215Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.0858534Z         x = x_sign * x_clamp
2025-05-07T20:32:30.0858783Z         x0 = x[:, :D]
2025-05-07T20:32:30.0859132Z         x1 = x[:, D:]
2025-05-07T20:32:30.0859349Z     
2025-05-07T20:32:30.0859557Z         if contiguous:
2025-05-07T20:32:30.0859793Z             x0 = x0.contiguous()
2025-05-07T20:32:30.0860067Z             x1 = x1.contiguous()
2025-05-07T20:32:30.0860317Z     
2025-05-07T20:32:30.0860513Z         if scale_ub is not None:
2025-05-07T20:32:30.0860800Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.0861153Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.0861464Z             )
2025-05-07T20:32:30.0861661Z         else:
2025-05-07T20:32:30.0861876Z             scale_ub_tensor = None
2025-05-07T20:32:30.0862133Z     
2025-05-07T20:32:30.0862383Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.0862719Z             op = silu_mul_quant
2025-05-07T20:32:30.0862973Z             if compiled:
2025-05-07T20:32:30.0863230Z                 op = torch.compile(op)
2025-05-07T20:32:30.0863542Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.0863820Z     
2025-05-07T20:32:30.0864020Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.0864319Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.0864614Z     
2025-05-07T20:32:30.0864862Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.0865214Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.0865513Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.0865840Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.0866215Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.0866574Z     
2025-05-07T20:32:30.0866801Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.0867012Z 
2025-05-07T20:32:30.0867117Z moe/activation_test.py:126: 
2025-05-07T20:32:30.0867504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.0867847Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.0868193Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.0869016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.0869791Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.0870357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.0871065Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.0871780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.0872527Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.0873308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:30.0874082Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.0874906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.0875569Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.0876253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.0876839Z     fn()
2025-05-07T20:32:30.0877367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.0877973Z     self.fn.run(
2025-05-07T20:32:30.0878459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.0879012Z     kernel = self.compile(
2025-05-07T20:32:30.0879569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.0880299Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.0880713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.0880948Z 
2025-05-07T20:32:30.0881166Z self = <triton.compiler.compiler.ASTSource object at 0x7fa8986860b0>
2025-05-07T20:32:30.0882282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.0883707Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa898eabbe0>}
2025-05-07T20:32:30.0885103Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.0886168Z context = <triton._C.libtriton.ir.context object at 0x7fa898887db0>
2025-05-07T20:32:30.0886466Z 
2025-05-07T20:32:30.0886641Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.0887228Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.0887718Z                            module_map=module_map)
2025-05-07T20:32:30.0888096Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.0888465Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.0888747Z E       ^
2025-05-07T20:32:30.0889228Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.0889741Z 
2025-05-07T20:32:30.0890171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.0890706Z 
2025-05-07T20:32:30.0890817Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.0891248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.0891663Z     T=128,
2025-05-07T20:32:30.0891856Z     D=5120,
2025-05-07T20:32:30.0892061Z     scale_ub=None,
2025-05-07T20:32:30.0892290Z     contiguous=True,
2025-05-07T20:32:30.0892524Z     compiled=True,
2025-05-07T20:32:30.0892744Z )
2025-05-07T20:32:30.9759418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.9760104Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:30.9760400Z 
2025-05-07T20:32:30.9760485Z     @given(
2025-05-07T20:32:30.9760738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.9761068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.9761387Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.9761735Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.9762082Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.9762502Z     )
2025-05-07T20:32:30.9762874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.9763333Z     def test_silu_mul_quant(
2025-05-07T20:32:30.9763584Z         self,
2025-05-07T20:32:30.9763852Z         T: int,
2025-05-07T20:32:30.9764057Z         D: int,
2025-05-07T20:32:30.9764283Z         scale_ub: Optional[float],
2025-05-07T20:32:30.9764568Z         contiguous: bool,
2025-05-07T20:32:30.9764816Z         compiled: bool,
2025-05-07T20:32:30.9765048Z     ) -> None:
2025-05-07T20:32:30.9765273Z         torch.manual_seed(2025)
2025-05-07T20:32:30.9765529Z     
2025-05-07T20:32:30.9765817Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.9766176Z     
2025-05-07T20:32:30.9766386Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.9766802Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.9767252Z         x = x_sign * x_clamp
2025-05-07T20:32:30.9767504Z         x0 = x[:, :D]
2025-05-07T20:32:30.9767730Z         x1 = x[:, D:]
2025-05-07T20:32:30.9767946Z     
2025-05-07T20:32:30.9768143Z         if contiguous:
2025-05-07T20:32:30.9768381Z             x0 = x0.contiguous()
2025-05-07T20:32:30.9768652Z             x1 = x1.contiguous()
2025-05-07T20:32:30.9768918Z     
2025-05-07T20:32:30.9769115Z         if scale_ub is not None:
2025-05-07T20:32:30.9769404Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.9769754Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.9770070Z             )
2025-05-07T20:32:30.9770266Z         else:
2025-05-07T20:32:30.9770485Z             scale_ub_tensor = None
2025-05-07T20:32:30.9770751Z     
2025-05-07T20:32:30.9770995Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.9771325Z             op = silu_mul_quant
2025-05-07T20:32:30.9771588Z             if compiled:
2025-05-07T20:32:30.9771846Z                 op = torch.compile(op)
2025-05-07T20:32:30.9772157Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.9772447Z     
2025-05-07T20:32:30.9772645Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.9772948Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.9773248Z     
2025-05-07T20:32:30.9773492Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.9773840Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.9774148Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.9774477Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.9774845Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.9775173Z     
2025-05-07T20:32:30.9775460Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.9775667Z 
2025-05-07T20:32:30.9775774Z moe/activation_test.py:126: 
2025-05-07T20:32:30.9776084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9776431Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.9776772Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.9777740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.9778640Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.9779208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.9779911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.9780636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.9781391Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.9782169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:30.9783006Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.9783766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.9784471Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.9785101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.9785634Z     fn()
2025-05-07T20:32:30.9786180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.9786811Z     self.fn.run(
2025-05-07T20:32:30.9787320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.9787882Z     kernel = self.compile(
2025-05-07T20:32:30.9788602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.9789285Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.9789690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.9789934Z 
2025-05-07T20:32:30.9790152Z self = <triton.compiler.compiler.ASTSource object at 0x7fa898a926b0>
2025-05-07T20:32:30.9791278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.9792715Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89804c280>}
2025-05-07T20:32:30.9794112Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.9795187Z context = <triton._C.libtriton.ir.context object at 0x7fa7adfa4130>
2025-05-07T20:32:30.9795487Z 
2025-05-07T20:32:30.9795660Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.9796208Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.9796693Z                            module_map=module_map)
2025-05-07T20:32:30.9797101Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.9797499Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.9797777Z E       ^
2025-05-07T20:32:30.9798314Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.9798801Z 
2025-05-07T20:32:30.9799404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.9799946Z 
2025-05-07T20:32:30.9800063Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.9800491Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.9800909Z     T=4096,
2025-05-07T20:32:30.9801109Z     D=5120,
2025-05-07T20:32:30.9801313Z     scale_ub=None,
2025-05-07T20:32:30.9801532Z     contiguous=True,
2025-05-07T20:32:30.9801763Z     compiled=True,
2025-05-07T20:32:30.9801975Z )
2025-05-07T20:32:31.7341357Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.7342151Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:31.7342543Z 
2025-05-07T20:32:31.7342661Z     @given(
2025-05-07T20:32:31.7342997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.7343415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.7343834Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.7344395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.7344735Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.7345031Z     )
2025-05-07T20:32:31.7345391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.7345901Z     def test_silu_mul_quant(
2025-05-07T20:32:31.7346158Z         self,
2025-05-07T20:32:31.7346363Z         T: int,
2025-05-07T20:32:31.7346560Z         D: int,
2025-05-07T20:32:31.7346789Z         scale_ub: Optional[float],
2025-05-07T20:32:31.7347072Z         contiguous: bool,
2025-05-07T20:32:31.7347323Z         compiled: bool,
2025-05-07T20:32:31.7347551Z     ) -> None:
2025-05-07T20:32:31.7347777Z         torch.manual_seed(2025)
2025-05-07T20:32:31.7348032Z     
2025-05-07T20:32:31.7348312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.7348665Z     
2025-05-07T20:32:31.7348938Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.7349242Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.7355946Z         x = x_sign * x_clamp
2025-05-07T20:32:31.7356209Z         x0 = x[:, :D]
2025-05-07T20:32:31.7356425Z         x1 = x[:, D:]
2025-05-07T20:32:31.7356632Z     
2025-05-07T20:32:31.7356827Z         if contiguous:
2025-05-07T20:32:31.7357056Z             x0 = x0.contiguous()
2025-05-07T20:32:31.7357312Z             x1 = x1.contiguous()
2025-05-07T20:32:31.7357553Z     
2025-05-07T20:32:31.7357749Z         if scale_ub is not None:
2025-05-07T20:32:31.7358018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.7358367Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.7358677Z             )
2025-05-07T20:32:31.7358865Z         else:
2025-05-07T20:32:31.7359081Z             scale_ub_tensor = None
2025-05-07T20:32:31.7359341Z     
2025-05-07T20:32:31.7359576Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.7359902Z             op = silu_mul_quant
2025-05-07T20:32:31.7360156Z             if compiled:
2025-05-07T20:32:31.7360409Z                 op = torch.compile(op)
2025-05-07T20:32:31.7360708Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.7360982Z     
2025-05-07T20:32:31.7361172Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.7361469Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.7361760Z     
2025-05-07T20:32:31.7361998Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.7362338Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.7362634Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.7362955Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.7363422Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.7363740Z     
2025-05-07T20:32:31.7363948Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.7364149Z 
2025-05-07T20:32:31.7364251Z moe/activation_test.py:126: 
2025-05-07T20:32:31.7364550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.7364888Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.7365215Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.7366022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.7366791Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.7367402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.7368087Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.7368791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.7369520Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.7370348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:31.7371100Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.7371890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.7372540Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.7373152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.7373669Z     fn()
2025-05-07T20:32:31.7374188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.7374776Z     self.fn.run(
2025-05-07T20:32:31.7375243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.7375844Z     kernel = self.compile(
2025-05-07T20:32:31.7376399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.7377113Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.7377507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.7377738Z 
2025-05-07T20:32:31.7377948Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7adf038e0>
2025-05-07T20:32:31.7379157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.7380559Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89804d2d0>}
2025-05-07T20:32:31.7381926Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.7382960Z context = <triton._C.libtriton.ir.context object at 0x7fa8983a2f70>
2025-05-07T20:32:31.7383250Z 
2025-05-07T20:32:31.7383424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.7383944Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.7384418Z                            module_map=module_map)
2025-05-07T20:32:31.7384786Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.7385190Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.7385462Z E       ^
2025-05-07T20:32:31.7385930Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.7386389Z 
2025-05-07T20:32:31.7386817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.7387333Z 
2025-05-07T20:32:31.7387437Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.7387857Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.7388261Z     T=16384,
2025-05-07T20:32:31.7388450Z     D=5120,
2025-05-07T20:32:31.7388651Z     scale_ub=None,
2025-05-07T20:32:31.7388869Z     contiguous=True,
2025-05-07T20:32:31.7389094Z     compiled=True,
2025-05-07T20:32:31.7389297Z )
2025-05-07T20:32:31.7775372Z W0507 20:32:31.775000 87841 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:31.7776899Z W0507 20:32:31.775000 87841 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:31.7778447Z W0507 20:32:31.775000 87841 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:31.7779451Z W0507 20:32:31.775000 87841 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:31.7780674Z W0507 20:32:31.775000 87841 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:31.8811379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8812146Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:31.8812543Z 
2025-05-07T20:32:31.8812652Z     @given(
2025-05-07T20:32:31.8812899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8813368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8813692Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8814038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8814374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8814681Z     )
2025-05-07T20:32:31.8815049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8815497Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8815751Z         self,
2025-05-07T20:32:31.8815957Z         T: int,
2025-05-07T20:32:31.8816161Z         D: int,
2025-05-07T20:32:31.8816391Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8816677Z         contiguous: bool,
2025-05-07T20:32:31.8816949Z         compiled: bool,
2025-05-07T20:32:31.8817212Z     ) -> None:
2025-05-07T20:32:31.8817442Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8817690Z     
2025-05-07T20:32:31.8818077Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8818432Z     
2025-05-07T20:32:31.8818638Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.8818939Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.8819258Z         x = x_sign * x_clamp
2025-05-07T20:32:31.8819510Z         x0 = x[:, :D]
2025-05-07T20:32:31.8819735Z         x1 = x[:, D:]
2025-05-07T20:32:31.8819951Z     
2025-05-07T20:32:31.8820146Z         if contiguous:
2025-05-07T20:32:31.8820385Z             x0 = x0.contiguous()
2025-05-07T20:32:31.8820654Z             x1 = x1.contiguous()
2025-05-07T20:32:31.8820903Z     
2025-05-07T20:32:31.8821101Z         if scale_ub is not None:
2025-05-07T20:32:31.8821387Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.8821818Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.8822136Z             )
2025-05-07T20:32:31.8822337Z         else:
2025-05-07T20:32:31.8822561Z             scale_ub_tensor = None
2025-05-07T20:32:31.8822820Z     
2025-05-07T20:32:31.8823063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.8823391Z             op = silu_mul_quant
2025-05-07T20:32:31.8823649Z             if compiled:
2025-05-07T20:32:31.8823901Z                 op = torch.compile(op)
2025-05-07T20:32:31.8824212Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.8824498Z     
2025-05-07T20:32:31.8824697Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.8824992Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.8825288Z     
2025-05-07T20:32:31.8825534Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.8825882Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.8826188Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.8826512Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.8826881Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.8827199Z     
2025-05-07T20:32:31.8827404Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.8827608Z 
2025-05-07T20:32:31.8827796Z moe/activation_test.py:126: 
2025-05-07T20:32:31.8828100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8828442Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.8828834Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.8829636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.8830407Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.8830962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.8831660Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.8832362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.8833145Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.8833911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:31.8834675Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.8835417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.8836073Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.8836682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.8837218Z     fn()
2025-05-07T20:32:31.8837742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.8838338Z     self.fn.run(
2025-05-07T20:32:31.8838818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.8839365Z     kernel = self.compile(
2025-05-07T20:32:31.8839921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.8840587Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.8840992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8841224Z 
2025-05-07T20:32:31.8841444Z self = <triton.compiler.compiler.ASTSource object at 0x7fa8982202b0>
2025-05-07T20:32:31.8842593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.8844000Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89806df30>}
2025-05-07T20:32:31.8845369Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.8846420Z context = <triton._C.libtriton.ir.context object at 0x7fa7adb6f370>
2025-05-07T20:32:31.8846719Z 
2025-05-07T20:32:31.8846895Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.8847424Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.8847909Z                            module_map=module_map)
2025-05-07T20:32:31.8848293Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.8848664Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.8848938Z E       ^
2025-05-07T20:32:31.8849458Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.8849920Z 
2025-05-07T20:32:31.8850350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.8850938Z 
2025-05-07T20:32:31.8851051Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8851472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8851883Z     T=1,
2025-05-07T20:32:31.8852075Z     D=5120,
2025-05-07T20:32:31.8852270Z     scale_ub=1200.0,
2025-05-07T20:32:31.8852503Z     contiguous=True,
2025-05-07T20:32:31.8852737Z     compiled=True,
2025-05-07T20:32:31.8852943Z )
2025-05-07T20:32:32.0296713Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.0297563Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.0298159Z 
2025-05-07T20:32:32.0298261Z     @given(
2025-05-07T20:32:32.0298515Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.0298843Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.0299155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.0299497Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.0299842Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.0300128Z     )
2025-05-07T20:32:32.0300490Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.0300939Z     def test_silu_mul_quant(
2025-05-07T20:32:32.0301179Z         self,
2025-05-07T20:32:32.0301382Z         T: int,
2025-05-07T20:32:32.0301584Z         D: int,
2025-05-07T20:32:32.0301806Z         scale_ub: Optional[float],
2025-05-07T20:32:32.0302091Z         contiguous: bool,
2025-05-07T20:32:32.0302339Z         compiled: bool,
2025-05-07T20:32:32.0302574Z     ) -> None:
2025-05-07T20:32:32.0302797Z         torch.manual_seed(2025)
2025-05-07T20:32:32.0303050Z     
2025-05-07T20:32:32.0303333Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.0303680Z     
2025-05-07T20:32:32.0303881Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.0304180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.0304492Z         x = x_sign * x_clamp
2025-05-07T20:32:32.0304739Z         x0 = x[:, :D]
2025-05-07T20:32:32.0304962Z         x1 = x[:, D:]
2025-05-07T20:32:32.0305169Z     
2025-05-07T20:32:32.0305362Z         if contiguous:
2025-05-07T20:32:32.0305600Z             x0 = x0.contiguous()
2025-05-07T20:32:32.0305862Z             x1 = x1.contiguous()
2025-05-07T20:32:32.0306109Z     
2025-05-07T20:32:32.0306313Z         if scale_ub is not None:
2025-05-07T20:32:32.0306672Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.0307022Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.0307338Z             )
2025-05-07T20:32:32.0307530Z         else:
2025-05-07T20:32:32.0307745Z             scale_ub_tensor = None
2025-05-07T20:32:32.0308009Z     
2025-05-07T20:32:32.0308252Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.0308570Z             op = silu_mul_quant
2025-05-07T20:32:32.0308826Z             if compiled:
2025-05-07T20:32:32.0309085Z                 op = torch.compile(op)
2025-05-07T20:32:32.0309386Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0309668Z     
2025-05-07T20:32:32.0309865Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.0310033Z 
2025-05-07T20:32:32.0310137Z moe/activation_test.py:117: 
2025-05-07T20:32:32.0310441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0310776Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.0311065Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.0311642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.0312217Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.0312964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.0313666Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.0314276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.0314975Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.0315654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.0316194Z     kernel = self.compile(
2025-05-07T20:32:32.0316753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.0317473Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.0317915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.0318151Z 
2025-05-07T20:32:32.0318363Z self = <triton.compiler.compiler.ASTSource object at 0x7fa8989b4ee0>
2025-05-07T20:32:32.0319471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.0320881Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89829b1c0>}
2025-05-07T20:32:32.0322256Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.0323300Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad5c4370>
2025-05-07T20:32:32.0323606Z 
2025-05-07T20:32:32.0323781Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.0324320Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.0324801Z                            module_map=module_map)
2025-05-07T20:32:32.0325171Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.0325535Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.0325801Z E       ^
2025-05-07T20:32:32.0326275Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.0326739Z 
2025-05-07T20:32:32.0327231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.0327785Z 
2025-05-07T20:32:32.0327893Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.0328317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.0328721Z     T=1,
2025-05-07T20:32:32.0328916Z     D=5120,
2025-05-07T20:32:32.0329114Z     scale_ub=None,
2025-05-07T20:32:32.0329334Z     contiguous=False,
2025-05-07T20:32:32.0329566Z     compiled=True,
2025-05-07T20:32:32.0329778Z )
2025-05-07T20:32:32.1003185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.1003975Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.1004372Z 
2025-05-07T20:32:32.1004490Z     @given(
2025-05-07T20:32:32.1004749Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.1005071Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.1005384Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.1005733Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.1006075Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.1006369Z     )
2025-05-07T20:32:32.1006732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.1007297Z     def test_silu_mul_quant(
2025-05-07T20:32:32.1007549Z         self,
2025-05-07T20:32:32.1007751Z         T: int,
2025-05-07T20:32:32.1007956Z         D: int,
2025-05-07T20:32:32.1008178Z         scale_ub: Optional[float],
2025-05-07T20:32:32.1008519Z         contiguous: bool,
2025-05-07T20:32:32.1008765Z         compiled: bool,
2025-05-07T20:32:32.1008989Z     ) -> None:
2025-05-07T20:32:32.1009216Z         torch.manual_seed(2025)
2025-05-07T20:32:32.1009468Z     
2025-05-07T20:32:32.1009751Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.1010103Z     
2025-05-07T20:32:32.1010308Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.1010612Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.1010927Z         x = x_sign * x_clamp
2025-05-07T20:32:32.1011173Z         x0 = x[:, :D]
2025-05-07T20:32:32.1011470Z         x1 = x[:, D:]
2025-05-07T20:32:32.1011679Z     
2025-05-07T20:32:32.1011872Z         if contiguous:
2025-05-07T20:32:32.1012118Z             x0 = x0.contiguous()
2025-05-07T20:32:32.1012380Z             x1 = x1.contiguous()
2025-05-07T20:32:32.1012628Z     
2025-05-07T20:32:32.1012827Z         if scale_ub is not None:
2025-05-07T20:32:32.1013110Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.1013458Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.1013774Z             )
2025-05-07T20:32:32.1013968Z         else:
2025-05-07T20:32:32.1014191Z             scale_ub_tensor = None
2025-05-07T20:32:32.1014450Z     
2025-05-07T20:32:32.1014687Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.1015011Z             op = silu_mul_quant
2025-05-07T20:32:32.1015271Z             if compiled:
2025-05-07T20:32:32.1015526Z                 op = torch.compile(op)
2025-05-07T20:32:32.1015827Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.1016113Z     
2025-05-07T20:32:32.1016315Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.1016605Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.1016901Z     
2025-05-07T20:32:32.1017178Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.1017542Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.1017847Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.1018254Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.1018618Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.1018938Z     
2025-05-07T20:32:32.1019147Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.1019348Z 
2025-05-07T20:32:32.1019455Z moe/activation_test.py:126: 
2025-05-07T20:32:32.1019824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.1020169Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.1020512Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.1021323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.1022102Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.1022667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.1023370Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.1024076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.1024819Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.1025595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:32.1026366Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.1027163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.1027823Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.1028484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.1029015Z     fn()
2025-05-07T20:32:32.1029542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.1030143Z     self.fn.run(
2025-05-07T20:32:32.1030628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.1031172Z     kernel = self.compile(
2025-05-07T20:32:32.1031733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.1032450Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.1032854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.1033089Z 
2025-05-07T20:32:32.1033303Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ade8fa30>
2025-05-07T20:32:32.1034418Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.1035832Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8984d3d00>}
2025-05-07T20:32:32.1037269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.1038322Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad544cb0>
2025-05-07T20:32:32.1038623Z 
2025-05-07T20:32:32.1038794Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.1039332Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.1039816Z                            module_map=module_map)
2025-05-07T20:32:32.1040189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.1040556Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.1040829Z E       ^
2025-05-07T20:32:32.1041301Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.1041764Z 
2025-05-07T20:32:32.1042239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.1042769Z 
2025-05-07T20:32:32.1042879Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.1043308Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.1043715Z     T=1,
2025-05-07T20:32:32.1043900Z     D=5120,
2025-05-07T20:32:32.1044098Z     scale_ub=None,
2025-05-07T20:32:32.1044316Z     contiguous=True,
2025-05-07T20:32:32.1044551Z     compiled=False,
2025-05-07T20:32:32.1044763Z )
2025-05-07T20:32:32.4260576Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.4261368Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:32.4268399Z 
2025-05-07T20:32:32.4268545Z     @given(
2025-05-07T20:32:32.4268878Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.4269321Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.4269632Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.4269964Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.4270290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.4270572Z     )
2025-05-07T20:32:32.4271037Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.4271482Z     def test_silu_mul_quant(
2025-05-07T20:32:32.4271723Z         self,
2025-05-07T20:32:32.4271978Z         T: int,
2025-05-07T20:32:32.4272165Z         D: int,
2025-05-07T20:32:32.4272380Z         scale_ub: Optional[float],
2025-05-07T20:32:32.4272656Z         contiguous: bool,
2025-05-07T20:32:32.4272898Z         compiled: bool,
2025-05-07T20:32:32.4273127Z     ) -> None:
2025-05-07T20:32:32.4273344Z         torch.manual_seed(2025)
2025-05-07T20:32:32.4273582Z     
2025-05-07T20:32:32.4273861Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.4274204Z     
2025-05-07T20:32:32.4274396Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.4274683Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.4275070Z         x = x_sign * x_clamp
2025-05-07T20:32:32.4275314Z         x0 = x[:, :D]
2025-05-07T20:32:32.4275528Z         x1 = x[:, D:]
2025-05-07T20:32:32.4275736Z     
2025-05-07T20:32:32.4275925Z         if contiguous:
2025-05-07T20:32:32.4276154Z             x0 = x0.contiguous()
2025-05-07T20:32:32.4276418Z             x1 = x1.contiguous()
2025-05-07T20:32:32.4276667Z     
2025-05-07T20:32:32.4276861Z         if scale_ub is not None:
2025-05-07T20:32:32.4277137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.4277480Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.4277783Z             )
2025-05-07T20:32:32.4277984Z         else:
2025-05-07T20:32:32.4278193Z             scale_ub_tensor = None
2025-05-07T20:32:32.4278445Z     
2025-05-07T20:32:32.4278690Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.4279004Z             op = silu_mul_quant
2025-05-07T20:32:32.4279249Z             if compiled:
2025-05-07T20:32:32.4279502Z                 op = torch.compile(op)
2025-05-07T20:32:32.4279802Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4280084Z     
2025-05-07T20:32:32.4280272Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.4280448Z 
2025-05-07T20:32:32.4280548Z moe/activation_test.py:117: 
2025-05-07T20:32:32.4280842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4281177Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.4281459Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4282190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.4282894Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.4283504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.4284192Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.4284874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.4285409Z     kernel = self.compile(
2025-05-07T20:32:32.4285954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.4286619Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.4287020Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4287269Z 
2025-05-07T20:32:32.4287507Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7adc97eb0>
2025-05-07T20:32:32.4288606Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.4290052Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8984d1ea0>}
2025-05-07T20:32:32.4291425Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.4292499Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad5e1cf0>
2025-05-07T20:32:32.4292795Z 
2025-05-07T20:32:32.4292963Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.4293489Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.4293961Z                            module_map=module_map)
2025-05-07T20:32:32.4294325Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.4294681Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.4294938Z E       ^
2025-05-07T20:32:32.4295451Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4295909Z 
2025-05-07T20:32:32.4296330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.4296850Z 
2025-05-07T20:32:32.4296957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.4297426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.4297821Z     T=128,
2025-05-07T20:32:32.4298110Z     D=5120,
2025-05-07T20:32:32.4298304Z     scale_ub=None,
2025-05-07T20:32:32.4298518Z     contiguous=False,
2025-05-07T20:32:32.4298748Z     compiled=True,
2025-05-07T20:32:32.4298950Z )
2025-05-07T20:32:32.4299271Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.4299765Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:32.4300037Z 
2025-05-07T20:32:32.4300115Z     @given(
2025-05-07T20:32:32.4300347Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.4300657Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.4300961Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.4301292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.4301625Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.4301911Z     )
2025-05-07T20:32:32.4302266Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.4302705Z     def test_silu_mul_quant(
2025-05-07T20:32:32.4302943Z         self,
2025-05-07T20:32:32.4303135Z         T: int,
2025-05-07T20:32:32.4303323Z         D: int,
2025-05-07T20:32:32.4303542Z         scale_ub: Optional[float],
2025-05-07T20:32:32.4303864Z         contiguous: bool,
2025-05-07T20:32:32.4304104Z         compiled: bool,
2025-05-07T20:32:32.4304321Z     ) -> None:
2025-05-07T20:32:32.4304537Z         torch.manual_seed(2025)
2025-05-07T20:32:32.4304784Z     
2025-05-07T20:32:32.4305059Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.4305395Z     
2025-05-07T20:32:32.4305588Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.4305875Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.4306186Z         x = x_sign * x_clamp
2025-05-07T20:32:32.4306425Z         x0 = x[:, :D]
2025-05-07T20:32:32.4306637Z         x1 = x[:, D:]
2025-05-07T20:32:32.4306846Z     
2025-05-07T20:32:32.4307028Z         if contiguous:
2025-05-07T20:32:32.4307249Z             x0 = x0.contiguous()
2025-05-07T20:32:32.4307507Z             x1 = x1.contiguous()
2025-05-07T20:32:32.4307747Z     
2025-05-07T20:32:32.4307935Z         if scale_ub is not None:
2025-05-07T20:32:32.4308212Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.4308543Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.4308849Z             )
2025-05-07T20:32:32.4309037Z         else:
2025-05-07T20:32:32.4309246Z             scale_ub_tensor = None
2025-05-07T20:32:32.4309493Z     
2025-05-07T20:32:32.4309767Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.4310083Z             op = silu_mul_quant
2025-05-07T20:32:32.4310328Z             if compiled:
2025-05-07T20:32:32.4310573Z                 op = torch.compile(op)
2025-05-07T20:32:32.4310908Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4311183Z     
2025-05-07T20:32:32.4311371Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.4311542Z 
2025-05-07T20:32:32.4311642Z moe/activation_test.py:117: 
2025-05-07T20:32:32.4311941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4312259Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.4312546Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4313110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.4313720Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.4314379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.4315078Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.4315621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.4316307Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.4316974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.4317561Z     kernel = self.compile(
2025-05-07T20:32:32.4318114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.4318774Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.4319169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4319401Z 
2025-05-07T20:32:32.4319615Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad59fe50>
2025-05-07T20:32:32.4320704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.4322093Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8984d0dc0>}
2025-05-07T20:32:32.4323497Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.4324536Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad443230>
2025-05-07T20:32:32.4324825Z 
2025-05-07T20:32:32.4324998Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.4325525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.4325996Z                            module_map=module_map)
2025-05-07T20:32:32.4326364Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.4326719Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.4326973Z E       ^
2025-05-07T20:32:32.4327498Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4327950Z 
2025-05-07T20:32:32.4328375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.4328894Z 
2025-05-07T20:32:32.4329007Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.4329423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.4329824Z     T=128,
2025-05-07T20:32:32.4330016Z     D=7168,
2025-05-07T20:32:32.4330252Z     scale_ub=1200.0,
2025-05-07T20:32:32.4330475Z     contiguous=False,
2025-05-07T20:32:32.4330698Z     compiled=False,
2025-05-07T20:32:32.4330893Z )
2025-05-07T20:32:32.5581634Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.5582430Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.5582813Z 
2025-05-07T20:32:32.5582939Z     @given(
2025-05-07T20:32:32.5583227Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.5583543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.5583862Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.5584209Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.5584541Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.5584832Z     )
2025-05-07T20:32:32.5585317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.5585768Z     def test_silu_mul_quant(
2025-05-07T20:32:32.5586018Z         self,
2025-05-07T20:32:32.5586213Z         T: int,
2025-05-07T20:32:32.5586407Z         D: int,
2025-05-07T20:32:32.5586634Z         scale_ub: Optional[float],
2025-05-07T20:32:32.5586913Z         contiguous: bool,
2025-05-07T20:32:32.5587149Z         compiled: bool,
2025-05-07T20:32:32.5587373Z     ) -> None:
2025-05-07T20:32:32.5587589Z         torch.manual_seed(2025)
2025-05-07T20:32:32.5587831Z     
2025-05-07T20:32:32.5588105Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.5588453Z     
2025-05-07T20:32:32.5588648Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.5588944Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.5589263Z         x = x_sign * x_clamp
2025-05-07T20:32:32.5589508Z         x0 = x[:, :D]
2025-05-07T20:32:32.5589734Z         x1 = x[:, D:]
2025-05-07T20:32:32.5589936Z     
2025-05-07T20:32:32.5590125Z         if contiguous:
2025-05-07T20:32:32.5590363Z             x0 = x0.contiguous()
2025-05-07T20:32:32.5590622Z             x1 = x1.contiguous()
2025-05-07T20:32:32.5590862Z     
2025-05-07T20:32:32.5591056Z         if scale_ub is not None:
2025-05-07T20:32:32.5591330Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.5591673Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.5591991Z             )
2025-05-07T20:32:32.5592180Z         else:
2025-05-07T20:32:32.5592392Z             scale_ub_tensor = None
2025-05-07T20:32:32.5592647Z     
2025-05-07T20:32:32.5592878Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.5593196Z             op = silu_mul_quant
2025-05-07T20:32:32.5593523Z             if compiled:
2025-05-07T20:32:32.5593772Z                 op = torch.compile(op)
2025-05-07T20:32:32.5594073Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.5594352Z     
2025-05-07T20:32:32.5594547Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.5594713Z 
2025-05-07T20:32:32.5594815Z moe/activation_test.py:117: 
2025-05-07T20:32:32.5595112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.5595442Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.5595728Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.5596434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.5597142Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.5597685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.5598385Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.5599061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.5599606Z     kernel = self.compile(
2025-05-07T20:32:32.5600223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.5600896Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.5601294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.5601561Z 
2025-05-07T20:32:32.5601774Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad4dc6a0>
2025-05-07T20:32:32.5602872Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.5604280Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89806e200>}
2025-05-07T20:32:32.5605701Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.5606752Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad422fb0>
2025-05-07T20:32:32.5607046Z 
2025-05-07T20:32:32.5607220Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.5607743Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.5608216Z                            module_map=module_map)
2025-05-07T20:32:32.5608585Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.5608937Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.5609197Z E       ^
2025-05-07T20:32:32.5609669Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.5610128Z 
2025-05-07T20:32:32.5610559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.5611077Z 
2025-05-07T20:32:32.5611182Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.5611605Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.5612011Z     T=128,
2025-05-07T20:32:32.5612228Z     D=5120,
2025-05-07T20:32:32.5612419Z     scale_ub=None,
2025-05-07T20:32:32.5612639Z     contiguous=False,
2025-05-07T20:32:32.5612869Z     compiled=False,
2025-05-07T20:32:32.5613072Z )
2025-05-07T20:32:32.5613397Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.5613895Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:32.5614228Z 
2025-05-07T20:32:32.5614312Z     @given(
2025-05-07T20:32:32.5614538Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.5614854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.5615163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.5615491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.5615826Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.5616111Z     )
2025-05-07T20:32:32.5616464Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.5616910Z     def test_silu_mul_quant(
2025-05-07T20:32:32.5617155Z         self,
2025-05-07T20:32:32.5617345Z         T: int,
2025-05-07T20:32:32.5617542Z         D: int,
2025-05-07T20:32:32.5617757Z         scale_ub: Optional[float],
2025-05-07T20:32:32.5618109Z         contiguous: bool,
2025-05-07T20:32:32.5618368Z         compiled: bool,
2025-05-07T20:32:32.5618604Z     ) -> None:
2025-05-07T20:32:32.5618831Z         torch.manual_seed(2025)
2025-05-07T20:32:32.5619084Z     
2025-05-07T20:32:32.5619379Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.5619765Z     
2025-05-07T20:32:32.5619963Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.5620332Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.5620676Z         x = x_sign * x_clamp
2025-05-07T20:32:32.5620929Z         x0 = x[:, :D]
2025-05-07T20:32:32.5621170Z         x1 = x[:, D:]
2025-05-07T20:32:32.5621434Z     
2025-05-07T20:32:32.5621625Z         if contiguous:
2025-05-07T20:32:32.5621873Z             x0 = x0.contiguous()
2025-05-07T20:32:32.5622154Z             x1 = x1.contiguous()
2025-05-07T20:32:32.5622408Z     
2025-05-07T20:32:32.5622608Z         if scale_ub is not None:
2025-05-07T20:32:32.5622906Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.5623277Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.5623627Z             )
2025-05-07T20:32:32.5623830Z         else:
2025-05-07T20:32:32.5624047Z             scale_ub_tensor = None
2025-05-07T20:32:32.5624319Z     
2025-05-07T20:32:32.5624601Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.5624920Z             op = silu_mul_quant
2025-05-07T20:32:32.5625166Z             if compiled:
2025-05-07T20:32:32.5625414Z                 op = torch.compile(op)
2025-05-07T20:32:32.5625715Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.5625989Z     
2025-05-07T20:32:32.5626189Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.5626354Z 
2025-05-07T20:32:32.5626460Z moe/activation_test.py:117: 
2025-05-07T20:32:32.5626750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.5627082Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.5627398Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.5628126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.5628822Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.5629371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.5630069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.5630736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.5631278Z     kernel = self.compile(
2025-05-07T20:32:32.5631831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.5632499Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.5632889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.5633121Z 
2025-05-07T20:32:32.5633380Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad48b0d0>
2025-05-07T20:32:32.5634488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.5635891Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa898626a70>}
2025-05-07T20:32:32.5637262Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.5638310Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad852c70>
2025-05-07T20:32:32.5638608Z 
2025-05-07T20:32:32.5638776Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.5639309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.5639779Z                            module_map=module_map)
2025-05-07T20:32:32.5640151Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.5640547Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.5640808Z E       ^
2025-05-07T20:32:32.5641274Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.5641772Z 
2025-05-07T20:32:32.5642191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.5642708Z 
2025-05-07T20:32:32.5642817Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.5643233Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.5643635Z     T=128,
2025-05-07T20:32:32.5643823Z     D=5120,
2025-05-07T20:32:32.5644021Z     scale_ub=1200.0,
2025-05-07T20:32:32.5644242Z     contiguous=True,
2025-05-07T20:32:32.5644467Z     compiled=False,
2025-05-07T20:32:32.5644674Z )
2025-05-07T20:32:32.7570620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7571220Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:32.7571495Z 
2025-05-07T20:32:32.7571574Z     @given(
2025-05-07T20:32:32.7571811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7572135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7572441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7572774Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7573105Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7573393Z     )
2025-05-07T20:32:32.7573745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7574191Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7574440Z         self,
2025-05-07T20:32:32.7574634Z         T: int,
2025-05-07T20:32:32.7574831Z         D: int,
2025-05-07T20:32:32.7575055Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7575329Z         contiguous: bool,
2025-05-07T20:32:32.7575572Z         compiled: bool,
2025-05-07T20:32:32.7575803Z     ) -> None:
2025-05-07T20:32:32.7576018Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7576261Z     
2025-05-07T20:32:32.7576542Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7576886Z     
2025-05-07T20:32:32.7577082Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7577382Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7577688Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7577930Z         x0 = x[:, :D]
2025-05-07T20:32:32.7578224Z         x1 = x[:, D:]
2025-05-07T20:32:32.7578428Z     
2025-05-07T20:32:32.7578616Z         if contiguous:
2025-05-07T20:32:32.7578851Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7579220Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7579461Z     
2025-05-07T20:32:32.7579657Z         if scale_ub is not None:
2025-05-07T20:32:32.7579938Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7580277Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7580585Z             )
2025-05-07T20:32:32.7580779Z         else:
2025-05-07T20:32:32.7580988Z             scale_ub_tensor = None
2025-05-07T20:32:32.7581243Z     
2025-05-07T20:32:32.7581479Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7581791Z             op = silu_mul_quant
2025-05-07T20:32:32.7582045Z             if compiled:
2025-05-07T20:32:32.7582294Z                 op = torch.compile(op)
2025-05-07T20:32:32.7582590Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7582861Z     
2025-05-07T20:32:32.7583057Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7583224Z 
2025-05-07T20:32:32.7583332Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7583621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7583946Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7584238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7584999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7585703Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7586246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7587031Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7593632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7594184Z     kernel = self.compile(
2025-05-07T20:32:32.7594751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7595419Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7595927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7596155Z 
2025-05-07T20:32:32.7596369Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad48aef0>
2025-05-07T20:32:32.7597470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7598871Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8986272e0>}
2025-05-07T20:32:32.7600233Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7601278Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad82e2b0>
2025-05-07T20:32:32.7601577Z 
2025-05-07T20:32:32.7601753Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7602280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7602751Z                            module_map=module_map)
2025-05-07T20:32:32.7603127Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7603493Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7603753Z E       ^
2025-05-07T20:32:32.7604230Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7604692Z 
2025-05-07T20:32:32.7605167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7605686Z 
2025-05-07T20:32:32.7605796Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7606212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7606616Z     T=1,
2025-05-07T20:32:32.7606806Z     D=7168,
2025-05-07T20:32:32.7607005Z     scale_ub=1200.0,
2025-05-07T20:32:32.7607237Z     contiguous=True,
2025-05-07T20:32:32.7607464Z     compiled=True,
2025-05-07T20:32:32.7607677Z )
2025-05-07T20:32:32.7607997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.7608686Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.7608953Z 
2025-05-07T20:32:32.7609039Z     @given(
2025-05-07T20:32:32.7609272Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.7609589Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.7609897Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.7610231Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.7610565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.7610855Z     )
2025-05-07T20:32:32.7611218Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.7611717Z     def test_silu_mul_quant(
2025-05-07T20:32:32.7611965Z         self,
2025-05-07T20:32:32.7612164Z         T: int,
2025-05-07T20:32:32.7612360Z         D: int,
2025-05-07T20:32:32.7612580Z         scale_ub: Optional[float],
2025-05-07T20:32:32.7612897Z         contiguous: bool,
2025-05-07T20:32:32.7613135Z         compiled: bool,
2025-05-07T20:32:32.7613363Z     ) -> None:
2025-05-07T20:32:32.7613584Z         torch.manual_seed(2025)
2025-05-07T20:32:32.7613826Z     
2025-05-07T20:32:32.7614105Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.7614448Z     
2025-05-07T20:32:32.7614642Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.7614940Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.7615251Z         x = x_sign * x_clamp
2025-05-07T20:32:32.7615488Z         x0 = x[:, :D]
2025-05-07T20:32:32.7615755Z         x1 = x[:, D:]
2025-05-07T20:32:32.7615961Z     
2025-05-07T20:32:32.7616150Z         if contiguous:
2025-05-07T20:32:32.7616387Z             x0 = x0.contiguous()
2025-05-07T20:32:32.7616650Z             x1 = x1.contiguous()
2025-05-07T20:32:32.7616888Z     
2025-05-07T20:32:32.7617083Z         if scale_ub is not None:
2025-05-07T20:32:32.7617362Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.7617699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.7618069Z             )
2025-05-07T20:32:32.7618268Z         else:
2025-05-07T20:32:32.7618477Z             scale_ub_tensor = None
2025-05-07T20:32:32.7618727Z     
2025-05-07T20:32:32.7618964Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.7619273Z             op = silu_mul_quant
2025-05-07T20:32:32.7619527Z             if compiled:
2025-05-07T20:32:32.7619776Z                 op = torch.compile(op)
2025-05-07T20:32:32.7620072Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7620354Z     
2025-05-07T20:32:32.7620549Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.7620714Z 
2025-05-07T20:32:32.7620816Z moe/activation_test.py:117: 
2025-05-07T20:32:32.7621109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7621442Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.7621730Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.7622293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.7622859Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.7623524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.7624215Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.7624803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.7625491Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.7626159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.7626691Z     kernel = self.compile(
2025-05-07T20:32:32.7627238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.7627911Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.7628310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.7628535Z 
2025-05-07T20:32:32.7628747Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad48b970>
2025-05-07T20:32:32.7629843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.7631278Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8986270a0>}
2025-05-07T20:32:32.7632637Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.7633707Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad67ae30>
2025-05-07T20:32:32.7634002Z 
2025-05-07T20:32:32.7634173Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.7634700Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.7635176Z                            module_map=module_map)
2025-05-07T20:32:32.7635539Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.7635897Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.7636200Z E       ^
2025-05-07T20:32:32.7636671Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.7637132Z 
2025-05-07T20:32:32.7637602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.7638126Z 
2025-05-07T20:32:32.7638231Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.7638649Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.7639047Z     T=1,
2025-05-07T20:32:32.7639238Z     D=7168,
2025-05-07T20:32:32.7639438Z     scale_ub=1200.0,
2025-05-07T20:32:32.7639663Z     contiguous=False,
2025-05-07T20:32:32.7639887Z     compiled=True,
2025-05-07T20:32:32.7640090Z )
2025-05-07T20:32:32.9013310Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.9014372Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:32.9014920Z 
2025-05-07T20:32:32.9015090Z     @given(
2025-05-07T20:32:32.9015564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.9016200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.9016824Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.9017408Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.9017749Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.9018117Z     )
2025-05-07T20:32:32.9018478Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.9018925Z     def test_silu_mul_quant(
2025-05-07T20:32:32.9019172Z         self,
2025-05-07T20:32:32.9019374Z         T: int,
2025-05-07T20:32:32.9019573Z         D: int,
2025-05-07T20:32:32.9019960Z         scale_ub: Optional[float],
2025-05-07T20:32:32.9020248Z         contiguous: bool,
2025-05-07T20:32:32.9020492Z         compiled: bool,
2025-05-07T20:32:32.9020730Z     ) -> None:
2025-05-07T20:32:32.9020953Z         torch.manual_seed(2025)
2025-05-07T20:32:32.9021198Z     
2025-05-07T20:32:32.9021480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.9021836Z     
2025-05-07T20:32:32.9022034Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.9022330Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.9022648Z         x = x_sign * x_clamp
2025-05-07T20:32:32.9022891Z         x0 = x[:, :D]
2025-05-07T20:32:32.9023113Z         x1 = x[:, D:]
2025-05-07T20:32:32.9023326Z     
2025-05-07T20:32:32.9023518Z         if contiguous:
2025-05-07T20:32:32.9023754Z             x0 = x0.contiguous()
2025-05-07T20:32:32.9024022Z             x1 = x1.contiguous()
2025-05-07T20:32:32.9024267Z     
2025-05-07T20:32:32.9024468Z         if scale_ub is not None:
2025-05-07T20:32:32.9024755Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.9025094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.9025404Z             )
2025-05-07T20:32:32.9025599Z         else:
2025-05-07T20:32:32.9025878Z             scale_ub_tensor = None
2025-05-07T20:32:32.9026130Z     
2025-05-07T20:32:32.9026369Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.9026693Z             op = silu_mul_quant
2025-05-07T20:32:32.9027008Z             if compiled:
2025-05-07T20:32:32.9027293Z                 op = torch.compile(op)
2025-05-07T20:32:32.9027631Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9027904Z     
2025-05-07T20:32:32.9028104Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.9028274Z 
2025-05-07T20:32:32.9028376Z moe/activation_test.py:117: 
2025-05-07T20:32:32.9028677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9029009Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.9029295Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.9029868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:32.9030506Z     return fn(*args, **kwargs)
2025-05-07T20:32:32.9031174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.9031879Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.9032426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.9033114Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.9033789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.9034328Z     kernel = self.compile(
2025-05-07T20:32:32.9034875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.9035542Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.9035944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.9036171Z 
2025-05-07T20:32:32.9036388Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad8134f0>
2025-05-07T20:32:32.9037539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.9038941Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa898626440>}
2025-05-07T20:32:32.9040354Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.9041404Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad6e6930>
2025-05-07T20:32:32.9041696Z 
2025-05-07T20:32:32.9041876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.9042404Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.9042893Z                            module_map=module_map)
2025-05-07T20:32:32.9043263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.9043622Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.9043888Z E       ^
2025-05-07T20:32:32.9044360Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.9044815Z 
2025-05-07T20:32:32.9045239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.9045764Z 
2025-05-07T20:32:32.9045871Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.9046296Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.9046800Z     T=1,
2025-05-07T20:32:32.9046993Z     D=7168,
2025-05-07T20:32:32.9047191Z     scale_ub=None,
2025-05-07T20:32:32.9047438Z     contiguous=False,
2025-05-07T20:32:32.9047688Z     compiled=True,
2025-05-07T20:32:32.9047936Z )
2025-05-07T20:32:33.1612078Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.1612582Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.1612894Z 
2025-05-07T20:32:33.1613009Z     @given(
2025-05-07T20:32:33.1613345Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.1613790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.1614113Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.1614446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.1614785Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.1615205Z     )
2025-05-07T20:32:33.1615563Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.1616014Z     def test_silu_mul_quant(
2025-05-07T20:32:33.1616260Z         self,
2025-05-07T20:32:33.1616454Z         T: int,
2025-05-07T20:32:33.1616655Z         D: int,
2025-05-07T20:32:33.1616885Z         scale_ub: Optional[float],
2025-05-07T20:32:33.1617168Z         contiguous: bool,
2025-05-07T20:32:33.1617459Z         compiled: bool,
2025-05-07T20:32:33.1617688Z     ) -> None:
2025-05-07T20:32:33.1617906Z         torch.manual_seed(2025)
2025-05-07T20:32:33.1618236Z     
2025-05-07T20:32:33.1618521Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.1618873Z     
2025-05-07T20:32:33.1619070Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.1619366Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.1619679Z         x = x_sign * x_clamp
2025-05-07T20:32:33.1619922Z         x0 = x[:, :D]
2025-05-07T20:32:33.1620146Z         x1 = x[:, D:]
2025-05-07T20:32:33.1620352Z     
2025-05-07T20:32:33.1620539Z         if contiguous:
2025-05-07T20:32:33.1620775Z             x0 = x0.contiguous()
2025-05-07T20:32:33.1621040Z             x1 = x1.contiguous()
2025-05-07T20:32:33.1621279Z     
2025-05-07T20:32:33.1621479Z         if scale_ub is not None:
2025-05-07T20:32:33.1621763Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.1622104Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.1622414Z             )
2025-05-07T20:32:33.1622613Z         else:
2025-05-07T20:32:33.1622825Z             scale_ub_tensor = None
2025-05-07T20:32:33.1623084Z     
2025-05-07T20:32:33.1623323Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.1623723Z             op = silu_mul_quant
2025-05-07T20:32:33.1623978Z             if compiled:
2025-05-07T20:32:33.1624230Z                 op = torch.compile(op)
2025-05-07T20:32:33.1624535Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.1624809Z     
2025-05-07T20:32:33.1625007Z         y_fp8, y_scale = fn()
2025-05-07T20:32:33.1625305Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:33.1625593Z     
2025-05-07T20:32:33.1625844Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.1626190Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:33.1626486Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:33.1626807Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:33.1627177Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.1627514Z     
2025-05-07T20:32:33.1627743Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:33.1627947Z 
2025-05-07T20:32:33.1628051Z moe/activation_test.py:126: 
2025-05-07T20:32:33.1628358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1628692Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:33.1629030Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.1629901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:33.1630674Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:33.1631286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.1631990Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.1632697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:33.1633437Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.1634213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:33.1635022Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.1635775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:33.1636427Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:33.1637048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:33.1637576Z     fn()
2025-05-07T20:32:33.1638098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:33.1638689Z     self.fn.run(
2025-05-07T20:32:33.1639171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.1639718Z     kernel = self.compile(
2025-05-07T20:32:33.1640269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.1640948Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.1641352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.1641582Z 
2025-05-07T20:32:33.1641800Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad69d720>
2025-05-07T20:32:33.1642925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.1644387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7adcaadd0>}
2025-05-07T20:32:33.1645772Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.1646829Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad6a5fb0>
2025-05-07T20:32:33.1647127Z 
2025-05-07T20:32:33.1647319Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.1647890Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.1648374Z                            module_map=module_map)
2025-05-07T20:32:33.1648748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.1649111Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:33.1649385Z E       ^
2025-05-07T20:32:33.1649865Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.1650324Z 
2025-05-07T20:32:33.1650755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.1651282Z 
2025-05-07T20:32:33.1651390Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.1651858Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.1652269Z     T=1,
2025-05-07T20:32:33.1652452Z     D=5120,
2025-05-07T20:32:33.1652651Z     scale_ub=1200.0,
2025-05-07T20:32:33.1652921Z     contiguous=False,
2025-05-07T20:32:33.1653148Z     compiled=True,
2025-05-07T20:32:33.1653359Z )
2025-05-07T20:32:33.3334396Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.3334921Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:33.3337688Z 
2025-05-07T20:32:33.3338100Z     @given(
2025-05-07T20:32:33.3338456Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.3338951Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.3339291Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.3339947Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.3340287Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.3340598Z     )
2025-05-07T20:32:33.3340973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.3341433Z     def test_silu_mul_quant(
2025-05-07T20:32:33.3341702Z         self,
2025-05-07T20:32:33.3341911Z         T: int,
2025-05-07T20:32:33.3342158Z         D: int,
2025-05-07T20:32:33.3342404Z         scale_ub: Optional[float],
2025-05-07T20:32:33.3342688Z         contiguous: bool,
2025-05-07T20:32:33.3342948Z         compiled: bool,
2025-05-07T20:32:33.3343188Z     ) -> None:
2025-05-07T20:32:33.3343410Z         torch.manual_seed(2025)
2025-05-07T20:32:33.3343667Z     
2025-05-07T20:32:33.3343961Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.3344325Z     
2025-05-07T20:32:33.3344523Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.3344831Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.3345159Z         x = x_sign * x_clamp
2025-05-07T20:32:33.3345408Z         x0 = x[:, :D]
2025-05-07T20:32:33.3345641Z         x1 = x[:, D:]
2025-05-07T20:32:33.3345863Z     
2025-05-07T20:32:33.3346058Z         if contiguous:
2025-05-07T20:32:33.3346301Z             x0 = x0.contiguous()
2025-05-07T20:32:33.3346580Z             x1 = x1.contiguous()
2025-05-07T20:32:33.3346827Z     
2025-05-07T20:32:33.3347036Z         if scale_ub is not None:
2025-05-07T20:32:33.3347325Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.3347669Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.3347994Z             )
2025-05-07T20:32:33.3348201Z         else:
2025-05-07T20:32:33.3348419Z             scale_ub_tensor = None
2025-05-07T20:32:33.3348686Z     
2025-05-07T20:32:33.3349029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.3349366Z             op = silu_mul_quant
2025-05-07T20:32:33.3349628Z             if compiled:
2025-05-07T20:32:33.3349894Z                 op = torch.compile(op)
2025-05-07T20:32:33.3350214Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.3350494Z     
2025-05-07T20:32:33.3350701Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.3350871Z 
2025-05-07T20:32:33.3350989Z moe/activation_test.py:117: 
2025-05-07T20:32:33.3351298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.3351647Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.3351945Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.3352526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.3353113Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.3353800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.3354519Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.3355152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.3356163Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.3356851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.3357502Z     kernel = self.compile(
2025-05-07T20:32:33.3358113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.3358792Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.3359203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.3359438Z 
2025-05-07T20:32:33.3359657Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad857b20>
2025-05-07T20:32:33.3360775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.3362277Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7adcabeb0>}
2025-05-07T20:32:33.3363664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.3364720Z context = <triton._C.libtriton.ir.context object at 0x7fa7ada4fd70>
2025-05-07T20:32:33.3365016Z 
2025-05-07T20:32:33.3365190Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.3365728Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.3366232Z                            module_map=module_map)
2025-05-07T20:32:33.3374276Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.3374663Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.3374930Z E       ^
2025-05-07T20:32:33.3375419Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.3375885Z 
2025-05-07T20:32:33.3376325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.3376851Z 
2025-05-07T20:32:33.3376972Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.3377393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.3377811Z     T=1,
2025-05-07T20:32:33.3378099Z     D=5120,
2025-05-07T20:32:33.3378412Z     scale_ub=1200.0,
2025-05-07T20:32:33.3378655Z     contiguous=False,
2025-05-07T20:32:33.3378895Z     compiled=False,
2025-05-07T20:32:33.3379112Z )
2025-05-07T20:32:33.3379449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.3379964Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:33.3380237Z 
2025-05-07T20:32:33.3380318Z     @given(
2025-05-07T20:32:33.3380570Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.3380904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.3381228Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.3381568Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.3381919Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.3382220Z     )
2025-05-07T20:32:33.3382578Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.3383039Z     def test_silu_mul_quant(
2025-05-07T20:32:33.3383295Z         self,
2025-05-07T20:32:33.3383495Z         T: int,
2025-05-07T20:32:33.3383705Z         D: int,
2025-05-07T20:32:33.3383944Z         scale_ub: Optional[float],
2025-05-07T20:32:33.3384222Z         contiguous: bool,
2025-05-07T20:32:33.3384552Z         compiled: bool,
2025-05-07T20:32:33.3384792Z     ) -> None:
2025-05-07T20:32:33.3385017Z         torch.manual_seed(2025)
2025-05-07T20:32:33.3385278Z     
2025-05-07T20:32:33.3385569Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.3385974Z     
2025-05-07T20:32:33.3386176Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.3386487Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.3386812Z         x = x_sign * x_clamp
2025-05-07T20:32:33.3387057Z         x0 = x[:, :D]
2025-05-07T20:32:33.3387291Z         x1 = x[:, D:]
2025-05-07T20:32:33.3387516Z     
2025-05-07T20:32:33.3387708Z         if contiguous:
2025-05-07T20:32:33.3387963Z             x0 = x0.contiguous()
2025-05-07T20:32:33.3388241Z             x1 = x1.contiguous()
2025-05-07T20:32:33.3388489Z     
2025-05-07T20:32:33.3388699Z         if scale_ub is not None:
2025-05-07T20:32:33.3389038Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.3389387Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.3389712Z             )
2025-05-07T20:32:33.3389920Z         else:
2025-05-07T20:32:33.3390140Z             scale_ub_tensor = None
2025-05-07T20:32:33.3390412Z     
2025-05-07T20:32:33.3390663Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.3390995Z             op = silu_mul_quant
2025-05-07T20:32:33.3391251Z             if compiled:
2025-05-07T20:32:33.3391519Z                 op = torch.compile(op)
2025-05-07T20:32:33.3391831Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.3392108Z     
2025-05-07T20:32:33.3392317Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.3392487Z 
2025-05-07T20:32:33.3392594Z moe/activation_test.py:117: 
2025-05-07T20:32:33.3392901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.3393247Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.3393535Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.3394250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.3394960Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.3395520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.3396216Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.3396902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.3397451Z     kernel = self.compile(
2025-05-07T20:32:33.3398105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.3398784Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.3399200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.3399433Z 
2025-05-07T20:32:33.3399655Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ada1a4a0>
2025-05-07T20:32:33.3400761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.3402180Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa898624940>}
2025-05-07T20:32:33.3403559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.3404610Z context = <triton._C.libtriton.ir.context object at 0x7fa7ada55430>
2025-05-07T20:32:33.3404905Z 
2025-05-07T20:32:33.3405132Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.3405663Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.3406150Z                            module_map=module_map)
2025-05-07T20:32:33.3406570Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.3406929Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.3407199Z E       ^
2025-05-07T20:32:33.3407680Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.3408138Z 
2025-05-07T20:32:33.3408574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.3409095Z 
2025-05-07T20:32:33.3409202Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.3409677Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.3410094Z     T=16384,
2025-05-07T20:32:33.3410291Z     D=5120,
2025-05-07T20:32:33.3410496Z     scale_ub=1200.0,
2025-05-07T20:32:33.3410731Z     contiguous=False,
2025-05-07T20:32:33.3410959Z     compiled=True,
2025-05-07T20:32:33.3411171Z )
2025-05-07T20:32:33.4403419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.4403977Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:33.4404268Z 
2025-05-07T20:32:33.4404352Z     @given(
2025-05-07T20:32:33.4404600Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.4404925Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.4405259Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.4405598Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.4405942Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.4406248Z     )
2025-05-07T20:32:33.4406612Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.4407068Z     def test_silu_mul_quant(
2025-05-07T20:32:33.4407326Z         self,
2025-05-07T20:32:33.4407555Z         T: int,
2025-05-07T20:32:33.4407786Z         D: int,
2025-05-07T20:32:33.4408045Z         scale_ub: Optional[float],
2025-05-07T20:32:33.4408332Z         contiguous: bool,
2025-05-07T20:32:33.4408586Z         compiled: bool,
2025-05-07T20:32:33.4408830Z     ) -> None:
2025-05-07T20:32:33.4409052Z         torch.manual_seed(2025)
2025-05-07T20:32:33.4409307Z     
2025-05-07T20:32:33.4409624Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.4409985Z     
2025-05-07T20:32:33.4410196Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.4410756Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.4411086Z         x = x_sign * x_clamp
2025-05-07T20:32:33.4411348Z         x0 = x[:, :D]
2025-05-07T20:32:33.4411575Z         x1 = x[:, D:]
2025-05-07T20:32:33.4411795Z     
2025-05-07T20:32:33.4412001Z         if contiguous:
2025-05-07T20:32:33.4412239Z             x0 = x0.contiguous()
2025-05-07T20:32:33.4412514Z             x1 = x1.contiguous()
2025-05-07T20:32:33.4412769Z     
2025-05-07T20:32:33.4412975Z         if scale_ub is not None:
2025-05-07T20:32:33.4413258Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.4413610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.4413931Z             )
2025-05-07T20:32:33.4414129Z         else:
2025-05-07T20:32:33.4414352Z             scale_ub_tensor = None
2025-05-07T20:32:33.4414619Z     
2025-05-07T20:32:33.4414865Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.4415198Z             op = silu_mul_quant
2025-05-07T20:32:33.4415460Z             if compiled:
2025-05-07T20:32:33.4415714Z                 op = torch.compile(op)
2025-05-07T20:32:33.4416030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4416316Z     
2025-05-07T20:32:33.4416594Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.4416772Z 
2025-05-07T20:32:33.4416878Z moe/activation_test.py:117: 
2025-05-07T20:32:33.4417184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4417626Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.4417939Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4418595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.4419171Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.4419840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.4420547Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.4421102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.4421882Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.4422561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.4423112Z     kernel = self.compile(
2025-05-07T20:32:33.4423672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.4424344Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.4424752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4424991Z 
2025-05-07T20:32:33.4425205Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad855780>
2025-05-07T20:32:33.4426316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.4427741Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ada448b0>}
2025-05-07T20:32:33.4429110Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.4430167Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad135cf0>
2025-05-07T20:32:33.4430468Z 
2025-05-07T20:32:33.4430640Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.4431223Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.4431699Z                            module_map=module_map)
2025-05-07T20:32:33.4432081Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.4432456Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.4432719Z E       ^
2025-05-07T20:32:33.4433204Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.4433670Z 
2025-05-07T20:32:33.4434094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.4434619Z 
2025-05-07T20:32:33.4434735Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.4435158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.4435572Z     T=2048,
2025-05-07T20:32:33.4435773Z     D=7168,
2025-05-07T20:32:33.4435970Z     scale_ub=1200.0,
2025-05-07T20:32:33.4436208Z     contiguous=False,
2025-05-07T20:32:33.4436448Z     compiled=True,
2025-05-07T20:32:33.4436658Z )
2025-05-07T20:32:33.4436990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.4437508Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:33.4437785Z 
2025-05-07T20:32:33.4437922Z     @given(
2025-05-07T20:32:33.4438156Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.4438482Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.4438802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.4439213Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.4439554Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.4439851Z     )
2025-05-07T20:32:33.4440209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.4440668Z     def test_silu_mul_quant(
2025-05-07T20:32:33.4440932Z         self,
2025-05-07T20:32:33.4441140Z         T: int,
2025-05-07T20:32:33.4441346Z         D: int,
2025-05-07T20:32:33.4441582Z         scale_ub: Optional[float],
2025-05-07T20:32:33.4441867Z         contiguous: bool,
2025-05-07T20:32:33.4442162Z         compiled: bool,
2025-05-07T20:32:33.4442398Z     ) -> None:
2025-05-07T20:32:33.4442633Z         torch.manual_seed(2025)
2025-05-07T20:32:33.4442881Z     
2025-05-07T20:32:33.4443173Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.4443529Z     
2025-05-07T20:32:33.4443737Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.4444040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.4444363Z         x = x_sign * x_clamp
2025-05-07T20:32:33.4444617Z         x0 = x[:, :D]
2025-05-07T20:32:33.4444843Z         x1 = x[:, D:]
2025-05-07T20:32:33.4445064Z     
2025-05-07T20:32:33.4445265Z         if contiguous:
2025-05-07T20:32:33.4445503Z             x0 = x0.contiguous()
2025-05-07T20:32:33.4445774Z             x1 = x1.contiguous()
2025-05-07T20:32:33.4446026Z     
2025-05-07T20:32:33.4446227Z         if scale_ub is not None:
2025-05-07T20:32:33.4446514Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.4446866Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.4447176Z             )
2025-05-07T20:32:33.4447390Z         else:
2025-05-07T20:32:33.4447652Z             scale_ub_tensor = None
2025-05-07T20:32:33.4447914Z     
2025-05-07T20:32:33.4448161Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.4448493Z             op = silu_mul_quant
2025-05-07T20:32:33.4448755Z             if compiled:
2025-05-07T20:32:33.4449008Z                 op = torch.compile(op)
2025-05-07T20:32:33.4449322Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4449610Z     
2025-05-07T20:32:33.4449807Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.4449987Z 
2025-05-07T20:32:33.4450092Z moe/activation_test.py:117: 
2025-05-07T20:32:33.4450450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4450783Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.4451079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.4451661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.4452240Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.4452909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.4453620Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.4454172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.4454864Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.4455797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.4456355Z     kernel = self.compile(
2025-05-07T20:32:33.4456914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.4457587Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.4458160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.4458420Z 
2025-05-07T20:32:33.4458645Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad153be0>
2025-05-07T20:32:33.4459810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.4461208Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ada45090>}
2025-05-07T20:32:33.4462584Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.4463696Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad148970>
2025-05-07T20:32:33.4463994Z 
2025-05-07T20:32:33.4464172Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.4464701Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.4465188Z                            module_map=module_map)
2025-05-07T20:32:33.4465568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.4465937Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.4466199Z E       ^
2025-05-07T20:32:33.4466680Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.4467139Z 
2025-05-07T20:32:33.4467576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.4468093Z 
2025-05-07T20:32:33.5756320Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.5756863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.5757458Z     T=1,
2025-05-07T20:32:33.5757679Z     D=5120,
2025-05-07T20:32:33.5757881Z     scale_ub=None,
2025-05-07T20:32:33.5758113Z     contiguous=False,
2025-05-07T20:32:33.5758351Z     compiled=False,
2025-05-07T20:32:33.5758572Z )
2025-05-07T20:32:33.5758907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.5759404Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:33.5759680Z 
2025-05-07T20:32:33.5759763Z     @given(
2025-05-07T20:32:33.5760009Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.5760327Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.5760932Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.5761281Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.5761622Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.5761920Z     )
2025-05-07T20:32:33.5762290Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.5762750Z     def test_silu_mul_quant(
2025-05-07T20:32:33.5762998Z         self,
2025-05-07T20:32:33.5763205Z         T: int,
2025-05-07T20:32:33.5763417Z         D: int,
2025-05-07T20:32:33.5763643Z         scale_ub: Optional[float],
2025-05-07T20:32:33.5763931Z         contiguous: bool,
2025-05-07T20:32:33.5764185Z         compiled: bool,
2025-05-07T20:32:33.5764419Z     ) -> None:
2025-05-07T20:32:33.5764647Z         torch.manual_seed(2025)
2025-05-07T20:32:33.5764902Z     
2025-05-07T20:32:33.5765184Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.5765543Z     
2025-05-07T20:32:33.5765756Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.5766053Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.5766378Z         x = x_sign * x_clamp
2025-05-07T20:32:33.5766632Z         x0 = x[:, :D]
2025-05-07T20:32:33.5766931Z         x1 = x[:, D:]
2025-05-07T20:32:33.5767152Z     
2025-05-07T20:32:33.5767352Z         if contiguous:
2025-05-07T20:32:33.5767596Z             x0 = x0.contiguous()
2025-05-07T20:32:33.5767903Z             x1 = x1.contiguous()
2025-05-07T20:32:33.5768243Z     
2025-05-07T20:32:33.5768448Z         if scale_ub is not None:
2025-05-07T20:32:33.5768728Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.5769077Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.5769398Z             )
2025-05-07T20:32:33.5769595Z         else:
2025-05-07T20:32:33.5769818Z             scale_ub_tensor = None
2025-05-07T20:32:33.5770085Z     
2025-05-07T20:32:33.5770328Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.5770656Z             op = silu_mul_quant
2025-05-07T20:32:33.5770916Z             if compiled:
2025-05-07T20:32:33.5771252Z                 op = torch.compile(op)
2025-05-07T20:32:33.5771564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5771856Z     
2025-05-07T20:32:33.5772057Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.5772234Z 
2025-05-07T20:32:33.5772339Z moe/activation_test.py:117: 
2025-05-07T20:32:33.5772643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5772993Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.5773278Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5773993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.5774706Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.5775255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.5775952Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.5776635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.5777184Z     kernel = self.compile(
2025-05-07T20:32:33.5777732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.5778550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.5778951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5779179Z 
2025-05-07T20:32:33.5779395Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad182410>
2025-05-07T20:32:33.5780541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.5781960Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ada457e0>}
2025-05-07T20:32:33.5783334Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.5784380Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad016530>
2025-05-07T20:32:33.5784670Z 
2025-05-07T20:32:33.5784843Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.5785378Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.5785857Z                            module_map=module_map)
2025-05-07T20:32:33.5786237Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.5786591Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.5786856Z E       ^
2025-05-07T20:32:33.5787343Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.5787844Z 
2025-05-07T20:32:33.5788268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.5788793Z 
2025-05-07T20:32:33.5788943Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.5789372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.5789781Z     T=4096,
2025-05-07T20:32:33.5789975Z     D=7168,
2025-05-07T20:32:33.5790181Z     scale_ub=1200.0,
2025-05-07T20:32:33.5790416Z     contiguous=False,
2025-05-07T20:32:33.5790645Z     compiled=False,
2025-05-07T20:32:33.5790863Z )
2025-05-07T20:32:33.5791197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.5791697Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:33.5791984Z 
2025-05-07T20:32:33.5792114Z     @given(
2025-05-07T20:32:33.5792358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.5792681Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.5792990Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.5793330Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.5793671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.5793955Z     )
2025-05-07T20:32:33.5794314Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.5794760Z     def test_silu_mul_quant(
2025-05-07T20:32:33.5795001Z         self,
2025-05-07T20:32:33.5795201Z         T: int,
2025-05-07T20:32:33.5795400Z         D: int,
2025-05-07T20:32:33.5795617Z         scale_ub: Optional[float],
2025-05-07T20:32:33.5795896Z         contiguous: bool,
2025-05-07T20:32:33.5796144Z         compiled: bool,
2025-05-07T20:32:33.5796366Z     ) -> None:
2025-05-07T20:32:33.5796589Z         torch.manual_seed(2025)
2025-05-07T20:32:33.5796843Z     
2025-05-07T20:32:33.5797124Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.5797467Z     
2025-05-07T20:32:33.5797668Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.5798016Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.5798328Z         x = x_sign * x_clamp
2025-05-07T20:32:33.5798575Z         x0 = x[:, :D]
2025-05-07T20:32:33.5798794Z         x1 = x[:, D:]
2025-05-07T20:32:33.5799000Z     
2025-05-07T20:32:33.5799194Z         if contiguous:
2025-05-07T20:32:33.5799430Z             x0 = x0.contiguous()
2025-05-07T20:32:33.5799690Z             x1 = x1.contiguous()
2025-05-07T20:32:33.5799935Z     
2025-05-07T20:32:33.5800133Z         if scale_ub is not None:
2025-05-07T20:32:33.5800462Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.5800806Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.5801116Z             )
2025-05-07T20:32:33.5801330Z         else:
2025-05-07T20:32:33.5809471Z             scale_ub_tensor = None
2025-05-07T20:32:33.5809736Z     
2025-05-07T20:32:33.5809987Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.5810305Z             op = silu_mul_quant
2025-05-07T20:32:33.5810561Z             if compiled:
2025-05-07T20:32:33.5810817Z                 op = torch.compile(op)
2025-05-07T20:32:33.5811116Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5811392Z     
2025-05-07T20:32:33.5811596Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.5811763Z 
2025-05-07T20:32:33.5811864Z moe/activation_test.py:117: 
2025-05-07T20:32:33.5812167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5812503Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.5812794Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5813491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.5814199Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.5814826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.5815519Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.5816238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.5816783Z     kernel = self.compile(
2025-05-07T20:32:33.5817337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.5818077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.5818484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5818710Z 
2025-05-07T20:32:33.5818928Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad1eae30>
2025-05-07T20:32:33.5820081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.5821478Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ada46200>}
2025-05-07T20:32:33.5822849Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.5823896Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad00dab0>
2025-05-07T20:32:33.5824191Z 
2025-05-07T20:32:33.5824371Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.5824897Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.5825381Z                            module_map=module_map)
2025-05-07T20:32:33.5825758Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.5826121Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.5826378Z E       ^
2025-05-07T20:32:33.5826853Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.5827316Z 
2025-05-07T20:32:33.5827792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.5828313Z 
2025-05-07T20:32:33.5828431Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.5828851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.5829309Z     T=16384,
2025-05-07T20:32:33.5829513Z     D=7168,
2025-05-07T20:32:33.5829705Z     scale_ub=None,
2025-05-07T20:32:33.5829932Z     contiguous=True,
2025-05-07T20:32:33.5830180Z     compiled=True,
2025-05-07T20:32:33.5830384Z )
2025-05-07T20:32:33.7769210Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.7769875Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:33.7770285Z 
2025-05-07T20:32:33.7770418Z     @given(
2025-05-07T20:32:33.7770726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.7771148Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.7771482Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.7771834Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.7772179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.7772472Z     )
2025-05-07T20:32:33.7772847Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.7773306Z     def test_silu_mul_quant(
2025-05-07T20:32:33.7773565Z         self,
2025-05-07T20:32:33.7773804Z         T: int,
2025-05-07T20:32:33.7774023Z         D: int,
2025-05-07T20:32:33.7774433Z         scale_ub: Optional[float],
2025-05-07T20:32:33.7774722Z         contiguous: bool,
2025-05-07T20:32:33.7774971Z         compiled: bool,
2025-05-07T20:32:33.7775208Z     ) -> None:
2025-05-07T20:32:33.7775436Z         torch.manual_seed(2025)
2025-05-07T20:32:33.7775763Z     
2025-05-07T20:32:33.7776052Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.7776405Z     
2025-05-07T20:32:33.7776604Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.7776910Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.7777230Z         x = x_sign * x_clamp
2025-05-07T20:32:33.7777474Z         x0 = x[:, :D]
2025-05-07T20:32:33.7777703Z         x1 = x[:, D:]
2025-05-07T20:32:33.7777951Z     
2025-05-07T20:32:33.7778258Z         if contiguous:
2025-05-07T20:32:33.7778503Z             x0 = x0.contiguous()
2025-05-07T20:32:33.7778777Z             x1 = x1.contiguous()
2025-05-07T20:32:33.7779114Z     
2025-05-07T20:32:33.7779320Z         if scale_ub is not None:
2025-05-07T20:32:33.7779617Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.7779968Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.7780287Z             )
2025-05-07T20:32:33.7780494Z         else:
2025-05-07T20:32:33.7780723Z             scale_ub_tensor = None
2025-05-07T20:32:33.7780988Z     
2025-05-07T20:32:33.7781237Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.7781564Z             op = silu_mul_quant
2025-05-07T20:32:33.7781821Z             if compiled:
2025-05-07T20:32:33.7782081Z                 op = torch.compile(op)
2025-05-07T20:32:33.7782395Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.7782674Z     
2025-05-07T20:32:33.7782883Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.7783054Z 
2025-05-07T20:32:33.7783165Z moe/activation_test.py:117: 
2025-05-07T20:32:33.7783464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.7783807Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.7784105Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.7784684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.7785260Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.7785943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.7786653Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.7787200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.7787992Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.7788679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.7789229Z     kernel = self.compile(
2025-05-07T20:32:33.7789792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.7790474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.7790886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.7791125Z 
2025-05-07T20:32:33.7791344Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad09a9b0>
2025-05-07T20:32:33.7792448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.7793879Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ada47760>}
2025-05-07T20:32:33.7795310Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.7796362Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad066ef0>
2025-05-07T20:32:33.7796697Z 
2025-05-07T20:32:33.7796870Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.7797408Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.7797894Z                            module_map=module_map)
2025-05-07T20:32:33.7798271Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.7798631Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.7798904Z E       ^
2025-05-07T20:32:33.7799385Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.7799890Z 
2025-05-07T20:32:33.7800318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.7800845Z 
2025-05-07T20:32:33.7800956Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.7801388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.7801811Z     T=4096,
2025-05-07T20:32:33.7802010Z     D=5120,
2025-05-07T20:32:33.7802217Z     scale_ub=None,
2025-05-07T20:32:33.7802472Z     contiguous=False,
2025-05-07T20:32:33.7802712Z     compiled=True,
2025-05-07T20:32:33.7802930Z )
2025-05-07T20:32:33.7803265Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.7803775Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:33.7804055Z 
2025-05-07T20:32:33.7804138Z     @given(
2025-05-07T20:32:33.7804384Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.7804716Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.7805040Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.7805384Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.7805728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.7806032Z     )
2025-05-07T20:32:33.7806395Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.7806851Z     def test_silu_mul_quant(
2025-05-07T20:32:33.7807102Z         self,
2025-05-07T20:32:33.7807300Z         T: int,
2025-05-07T20:32:33.7807506Z         D: int,
2025-05-07T20:32:33.7807734Z         scale_ub: Optional[float],
2025-05-07T20:32:33.7808011Z         contiguous: bool,
2025-05-07T20:32:33.7808263Z         compiled: bool,
2025-05-07T20:32:33.7808495Z     ) -> None:
2025-05-07T20:32:33.7808771Z         torch.manual_seed(2025)
2025-05-07T20:32:33.7809028Z     
2025-05-07T20:32:33.7809315Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.7809667Z     
2025-05-07T20:32:33.7809866Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.7810171Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.7810490Z         x = x_sign * x_clamp
2025-05-07T20:32:33.7810734Z         x0 = x[:, :D]
2025-05-07T20:32:33.7810964Z         x1 = x[:, D:]
2025-05-07T20:32:33.7811188Z     
2025-05-07T20:32:33.7811382Z         if contiguous:
2025-05-07T20:32:33.7811626Z             x0 = x0.contiguous()
2025-05-07T20:32:33.7811894Z             x1 = x1.contiguous()
2025-05-07T20:32:33.7812139Z     
2025-05-07T20:32:33.7812345Z         if scale_ub is not None:
2025-05-07T20:32:33.7812633Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.7812976Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.7813304Z             )
2025-05-07T20:32:33.7813512Z         else:
2025-05-07T20:32:33.7813729Z             scale_ub_tensor = None
2025-05-07T20:32:33.7813991Z     
2025-05-07T20:32:33.7814243Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.7814564Z             op = silu_mul_quant
2025-05-07T20:32:33.7814878Z             if compiled:
2025-05-07T20:32:33.7815141Z                 op = torch.compile(op)
2025-05-07T20:32:33.7815454Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.7815732Z     
2025-05-07T20:32:33.7815981Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.7816152Z 
2025-05-07T20:32:33.7816263Z moe/activation_test.py:117: 
2025-05-07T20:32:33.7816565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.7816922Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.7817212Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.7817794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:33.7818481Z     return fn(*args, **kwargs)
2025-05-07T20:32:33.7819158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.7819953Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.7820509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.7821209Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.7821889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.7822441Z     kernel = self.compile(
2025-05-07T20:32:33.7823004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.7823685Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.7824092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.7824329Z 
2025-05-07T20:32:33.7824549Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad0a7bb0>
2025-05-07T20:32:33.7825663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.7827075Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad768280>}
2025-05-07T20:32:33.7828451Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.7829553Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad7a1670>
2025-05-07T20:32:33.7829857Z 
2025-05-07T20:32:33.7830031Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.7830573Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.7831054Z                            module_map=module_map)
2025-05-07T20:32:33.7831433Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.7831804Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.7832072Z E       ^
2025-05-07T20:32:33.7832551Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.7833018Z 
2025-05-07T20:32:33.7833444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.7833966Z 
2025-05-07T20:32:34.1088923Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1089470Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1089891Z     T=4096,
2025-05-07T20:32:34.1090095Z     D=5120,
2025-05-07T20:32:34.1090311Z     scale_ub=1200.0,
2025-05-07T20:32:34.1090557Z     contiguous=False,
2025-05-07T20:32:34.1090795Z     compiled=False,
2025-05-07T20:32:34.1091304Z )
2025-05-07T20:32:34.1091647Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1092166Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:34.1092528Z 
2025-05-07T20:32:34.1092613Z     @given(
2025-05-07T20:32:34.1092858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1093184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1093501Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1093850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1094200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1094499Z     )
2025-05-07T20:32:34.1094862Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1095314Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1095658Z         self,
2025-05-07T20:32:34.1095858Z         T: int,
2025-05-07T20:32:34.1096070Z         D: int,
2025-05-07T20:32:34.1096299Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1096577Z         contiguous: bool,
2025-05-07T20:32:34.1096829Z         compiled: bool,
2025-05-07T20:32:34.1097069Z     ) -> None:
2025-05-07T20:32:34.1097296Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1097574Z     
2025-05-07T20:32:34.1097887Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1098301Z     
2025-05-07T20:32:34.1098506Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1098809Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1099122Z         x = x_sign * x_clamp
2025-05-07T20:32:34.1099373Z         x0 = x[:, :D]
2025-05-07T20:32:34.1099605Z         x1 = x[:, D:]
2025-05-07T20:32:34.1099820Z     
2025-05-07T20:32:34.1100028Z         if contiguous:
2025-05-07T20:32:34.1100284Z             x0 = x0.contiguous()
2025-05-07T20:32:34.1100550Z             x1 = x1.contiguous()
2025-05-07T20:32:34.1100801Z     
2025-05-07T20:32:34.1101005Z         if scale_ub is not None:
2025-05-07T20:32:34.1101283Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.1101628Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.1101947Z             )
2025-05-07T20:32:34.1102147Z         else:
2025-05-07T20:32:34.1102364Z             scale_ub_tensor = None
2025-05-07T20:32:34.1102626Z     
2025-05-07T20:32:34.1102870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.1103188Z             op = silu_mul_quant
2025-05-07T20:32:34.1103449Z             if compiled:
2025-05-07T20:32:34.1103705Z                 op = torch.compile(op)
2025-05-07T20:32:34.1104099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1104388Z     
2025-05-07T20:32:34.1104594Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.1104761Z 
2025-05-07T20:32:34.1104868Z moe/activation_test.py:117: 
2025-05-07T20:32:34.1105172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1105510Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.1105801Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1106502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.1107212Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.1107817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.1108508Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.1109189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.1109735Z     kernel = self.compile(
2025-05-07T20:32:34.1110291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.1111028Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.1111434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1111663Z 
2025-05-07T20:32:34.1111885Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad75a2c0>
2025-05-07T20:32:34.1113035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.1114456Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad769000>}
2025-05-07T20:32:34.1115834Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.1116935Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad7b2970>
2025-05-07T20:32:34.1117232Z 
2025-05-07T20:32:34.1117413Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.1117950Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.1118489Z                            module_map=module_map)
2025-05-07T20:32:34.1118873Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.1119239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.1119505Z E       ^
2025-05-07T20:32:34.1119985Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.1120442Z 
2025-05-07T20:32:34.1120875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.1121402Z 
2025-05-07T20:32:34.1121528Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.1121954Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.1122373Z     T=4096,
2025-05-07T20:32:34.1122577Z     D=5120,
2025-05-07T20:32:34.1122784Z     scale_ub=1200.0,
2025-05-07T20:32:34.1123024Z     contiguous=False,
2025-05-07T20:32:34.1123264Z     compiled=True,
2025-05-07T20:32:34.1123474Z )
2025-05-07T20:32:34.1123806Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.1124312Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:34.1124591Z 
2025-05-07T20:32:34.1124671Z     @given(
2025-05-07T20:32:34.1124960Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.1125286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.1125602Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.1125944Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.1126284Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.1126585Z     )
2025-05-07T20:32:34.1126945Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.1127397Z     def test_silu_mul_quant(
2025-05-07T20:32:34.1127655Z         self,
2025-05-07T20:32:34.1127855Z         T: int,
2025-05-07T20:32:34.1128062Z         D: int,
2025-05-07T20:32:34.1128293Z         scale_ub: Optional[float],
2025-05-07T20:32:34.1128568Z         contiguous: bool,
2025-05-07T20:32:34.1128819Z         compiled: bool,
2025-05-07T20:32:34.1129051Z     ) -> None:
2025-05-07T20:32:34.1129272Z         torch.manual_seed(2025)
2025-05-07T20:32:34.1129526Z     
2025-05-07T20:32:34.1129818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.1130165Z     
2025-05-07T20:32:34.1130371Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.1130676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.1130995Z         x = x_sign * x_clamp
2025-05-07T20:32:34.1131290Z         x0 = x[:, :D]
2025-05-07T20:32:34.1131521Z         x1 = x[:, D:]
2025-05-07T20:32:34.1131741Z     
2025-05-07T20:32:34.1131932Z         if contiguous:
2025-05-07T20:32:34.1132175Z             x0 = x0.contiguous()
2025-05-07T20:32:34.1132485Z             x1 = x1.contiguous()
2025-05-07T20:32:34.1132731Z     
2025-05-07T20:32:34.1132938Z         if scale_ub is not None:
2025-05-07T20:32:34.1133229Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.1133572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.1133892Z             )
2025-05-07T20:32:34.1134097Z         else:
2025-05-07T20:32:34.1134314Z             scale_ub_tensor = None
2025-05-07T20:32:34.1134584Z     
2025-05-07T20:32:34.1134830Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.1135149Z             op = silu_mul_quant
2025-05-07T20:32:34.1135459Z             if compiled:
2025-05-07T20:32:34.1135719Z                 op = torch.compile(op)
2025-05-07T20:32:34.1136037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1136317Z     
2025-05-07T20:32:34.1136520Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.1136688Z 
2025-05-07T20:32:34.1136796Z moe/activation_test.py:117: 
2025-05-07T20:32:34.1137094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1137432Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.1137724Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.1138401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.1138973Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.1139652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.1140366Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.1140912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.1141617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.1142293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.1142837Z     kernel = self.compile(
2025-05-07T20:32:34.1143395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.1144069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.1144477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.1144757Z 
2025-05-07T20:32:34.1144973Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad0981f0>
2025-05-07T20:32:34.1146076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.1147479Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad768700>}
2025-05-07T20:32:34.1148905Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.1149953Z context = <triton._C.libtriton.ir.context object at 0x7fa7acfab0b0>
2025-05-07T20:32:34.1150251Z 
2025-05-07T20:32:34.1150430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.1150972Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.1151459Z                            module_map=module_map)
2025-05-07T20:32:34.1151878Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.1152244Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.1152518Z E       ^
2025-05-07T20:32:34.1152994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.1153542Z 
2025-05-07T20:32:34.1162320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.1162899Z 
2025-05-07T20:32:34.2430622Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.2431154Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.2431725Z     T=2048,
2025-05-07T20:32:34.2431951Z     D=7168,
2025-05-07T20:32:34.2432159Z     scale_ub=1200.0,
2025-05-07T20:32:34.2432397Z     contiguous=False,
2025-05-07T20:32:34.2432640Z     compiled=False,
2025-05-07T20:32:34.2433181Z )
2025-05-07T20:32:34.2433522Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.2434053Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:34.2434354Z 
2025-05-07T20:32:34.2434438Z     @given(
2025-05-07T20:32:34.2434688Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.2435022Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.2435351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.2435708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.2436051Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.2436355Z     )
2025-05-07T20:32:34.2436728Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.2437191Z     def test_silu_mul_quant(
2025-05-07T20:32:34.2437442Z         self,
2025-05-07T20:32:34.2437653Z         T: int,
2025-05-07T20:32:34.2437901Z         D: int,
2025-05-07T20:32:34.2438140Z         scale_ub: Optional[float],
2025-05-07T20:32:34.2438431Z         contiguous: bool,
2025-05-07T20:32:34.2438685Z         compiled: bool,
2025-05-07T20:32:34.2438924Z     ) -> None:
2025-05-07T20:32:34.2439153Z         torch.manual_seed(2025)
2025-05-07T20:32:34.2439412Z     
2025-05-07T20:32:34.2439701Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.2440069Z     
2025-05-07T20:32:34.2440276Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.2440579Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.2440904Z         x = x_sign * x_clamp
2025-05-07T20:32:34.2441161Z         x0 = x[:, :D]
2025-05-07T20:32:34.2441382Z         x1 = x[:, D:]
2025-05-07T20:32:34.2441604Z     
2025-05-07T20:32:34.2441901Z         if contiguous:
2025-05-07T20:32:34.2442149Z             x0 = x0.contiguous()
2025-05-07T20:32:34.2442431Z             x1 = x1.contiguous()
2025-05-07T20:32:34.2442682Z     
2025-05-07T20:32:34.2442883Z         if scale_ub is not None:
2025-05-07T20:32:34.2443183Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.2443539Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.2443875Z             )
2025-05-07T20:32:34.2444073Z         else:
2025-05-07T20:32:34.2444304Z             scale_ub_tensor = None
2025-05-07T20:32:34.2444577Z     
2025-05-07T20:32:34.2444819Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.2445154Z             op = silu_mul_quant
2025-05-07T20:32:34.2445424Z             if compiled:
2025-05-07T20:32:34.2445713Z                 op = torch.compile(op)
2025-05-07T20:32:34.2446053Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.2446358Z     
2025-05-07T20:32:34.2446566Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.2446756Z 
2025-05-07T20:32:34.2446871Z moe/activation_test.py:117: 
2025-05-07T20:32:34.2447201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.2447614Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.2448035Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.2448756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.2449484Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.2450126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.2450836Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.2451524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.2452079Z     kernel = self.compile(
2025-05-07T20:32:34.2452647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.2453338Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.2453826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.2454064Z 
2025-05-07T20:32:34.2454282Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad73e590>
2025-05-07T20:32:34.2455412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.2457154Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad769240>}
2025-05-07T20:32:34.2458641Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.2459732Z context = <triton._C.libtriton.ir.context object at 0x7fa7acfc8770>
2025-05-07T20:32:34.2460039Z 
2025-05-07T20:32:34.2460215Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.2460758Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.2461244Z                            module_map=module_map)
2025-05-07T20:32:34.2461626Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.2461998Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.2462266Z E       ^
2025-05-07T20:32:34.2462752Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.2463227Z 
2025-05-07T20:32:34.2463772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.2464306Z 
2025-05-07T20:32:34.2464425Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.2464856Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.2465280Z     T=1,
2025-05-07T20:32:34.2465478Z     D=7168,
2025-05-07T20:32:34.2465682Z     scale_ub=None,
2025-05-07T20:32:34.2465902Z     contiguous=True,
2025-05-07T20:32:34.2466140Z     compiled=False,
2025-05-07T20:32:34.2466358Z )
2025-05-07T20:32:34.2466685Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.2467182Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:34.2467449Z 
2025-05-07T20:32:34.2467536Z     @given(
2025-05-07T20:32:34.2467772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.2468100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.2468451Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.2468811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.2469154Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.2469460Z     )
2025-05-07T20:32:34.2469892Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.2470343Z     def test_silu_mul_quant(
2025-05-07T20:32:34.2470594Z         self,
2025-05-07T20:32:34.2470800Z         T: int,
2025-05-07T20:32:34.2471000Z         D: int,
2025-05-07T20:32:34.2471287Z         scale_ub: Optional[float],
2025-05-07T20:32:34.2471569Z         contiguous: bool,
2025-05-07T20:32:34.2471813Z         compiled: bool,
2025-05-07T20:32:34.2472051Z     ) -> None:
2025-05-07T20:32:34.2472278Z         torch.manual_seed(2025)
2025-05-07T20:32:34.2472523Z     
2025-05-07T20:32:34.2472808Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.2473163Z     
2025-05-07T20:32:34.2473364Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.2473669Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.2473989Z         x = x_sign * x_clamp
2025-05-07T20:32:34.2474303Z         x0 = x[:, :D]
2025-05-07T20:32:34.2474531Z         x1 = x[:, D:]
2025-05-07T20:32:34.2474750Z     
2025-05-07T20:32:34.2474944Z         if contiguous:
2025-05-07T20:32:34.2475186Z             x0 = x0.contiguous()
2025-05-07T20:32:34.2475457Z             x1 = x1.contiguous()
2025-05-07T20:32:34.2475708Z     
2025-05-07T20:32:34.2475911Z         if scale_ub is not None:
2025-05-07T20:32:34.2476201Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.2476551Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.2476863Z             )
2025-05-07T20:32:34.2477067Z         else:
2025-05-07T20:32:34.2477290Z             scale_ub_tensor = None
2025-05-07T20:32:34.2477551Z     
2025-05-07T20:32:34.2477796Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.2478132Z             op = silu_mul_quant
2025-05-07T20:32:34.2478389Z             if compiled:
2025-05-07T20:32:34.2478650Z                 op = torch.compile(op)
2025-05-07T20:32:34.2478962Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.2479242Z     
2025-05-07T20:32:34.2479447Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.2479616Z 
2025-05-07T20:32:34.2479725Z moe/activation_test.py:117: 
2025-05-07T20:32:34.2480029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.2480366Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.2480661Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.2481367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.2482077Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.2482682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.2483383Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.2484064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.2484610Z     kernel = self.compile(
2025-05-07T20:32:34.2485170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.2485845Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.2486251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.2486486Z 
2025-05-07T20:32:34.2486700Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acfd0640>
2025-05-07T20:32:34.2487856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.2489272Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad76a050>}
2025-05-07T20:32:34.2490699Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.2491785Z context = <triton._C.libtriton.ir.context object at 0x7fa7acfe8670>
2025-05-07T20:32:34.2492086Z 
2025-05-07T20:32:34.2492259Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.2492795Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.2493279Z                            module_map=module_map)
2025-05-07T20:32:34.2493649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.2494017Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.2494288Z E       ^
2025-05-07T20:32:34.2494762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.2495275Z 
2025-05-07T20:32:34.2495703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.2496234Z 
2025-05-07T20:32:34.2496345Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.2496780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.2497189Z     T=16384,
2025-05-07T20:32:34.2497401Z     D=7168,
2025-05-07T20:32:34.2497615Z     scale_ub=1200.0,
2025-05-07T20:32:34.2497849Z     contiguous=False,
2025-05-07T20:32:34.2498190Z     compiled=True,
2025-05-07T20:32:34.5117568Z )
2025-05-07T20:32:34.5118144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.5118843Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:34.5119130Z 
2025-05-07T20:32:34.5119221Z     @given(
2025-05-07T20:32:34.5119466Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.5119798Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.5120124Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.5120458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.5120804Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.5121108Z     )
2025-05-07T20:32:34.5121473Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.5121923Z     def test_silu_mul_quant(
2025-05-07T20:32:34.5122173Z         self,
2025-05-07T20:32:34.5122378Z         T: int,
2025-05-07T20:32:34.5122578Z         D: int,
2025-05-07T20:32:34.5122806Z         scale_ub: Optional[float],
2025-05-07T20:32:34.5123092Z         contiguous: bool,
2025-05-07T20:32:34.5123568Z         compiled: bool,
2025-05-07T20:32:34.5123809Z     ) -> None:
2025-05-07T20:32:34.5124033Z         torch.manual_seed(2025)
2025-05-07T20:32:34.5124279Z     
2025-05-07T20:32:34.5124564Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.5124922Z     
2025-05-07T20:32:34.5125118Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.5125417Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.5125733Z         x = x_sign * x_clamp
2025-05-07T20:32:34.5125977Z         x0 = x[:, :D]
2025-05-07T20:32:34.5126204Z         x1 = x[:, D:]
2025-05-07T20:32:34.5126417Z     
2025-05-07T20:32:34.5126611Z         if contiguous:
2025-05-07T20:32:34.5126847Z             x0 = x0.contiguous()
2025-05-07T20:32:34.5127120Z             x1 = x1.contiguous()
2025-05-07T20:32:34.5127368Z     
2025-05-07T20:32:34.5127564Z         if scale_ub is not None:
2025-05-07T20:32:34.5127851Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.5128229Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.5128572Z             )
2025-05-07T20:32:34.5128777Z         else:
2025-05-07T20:32:34.5128998Z             scale_ub_tensor = None
2025-05-07T20:32:34.5129252Z     
2025-05-07T20:32:34.5129581Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.5129905Z             op = silu_mul_quant
2025-05-07T20:32:34.5130157Z             if compiled:
2025-05-07T20:32:34.5130416Z                 op = torch.compile(op)
2025-05-07T20:32:34.5130804Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.5131089Z     
2025-05-07T20:32:34.5131294Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.5131469Z 
2025-05-07T20:32:34.5131573Z moe/activation_test.py:117: 
2025-05-07T20:32:34.5131876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.5132209Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.5132501Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.5133086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.5133661Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.5134434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.5135146Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.5135702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.5136401Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.5137095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.5137644Z     kernel = self.compile(
2025-05-07T20:32:34.5138343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.5139049Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.5139458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.5139691Z 
2025-05-07T20:32:34.5139915Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acf50310>
2025-05-07T20:32:34.5141021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.5142451Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad76b490>}
2025-05-07T20:32:34.5143888Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.5144944Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad228d70>
2025-05-07T20:32:34.5145240Z 
2025-05-07T20:32:34.5145421Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.5145957Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.5146440Z                            module_map=module_map)
2025-05-07T20:32:34.5146814Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.5147178Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.5147447Z E       ^
2025-05-07T20:32:34.5147928Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.5148387Z 
2025-05-07T20:32:34.5148817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.5149340Z 
2025-05-07T20:32:34.5149454Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.5149885Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.5150300Z     T=1,
2025-05-07T20:32:34.5150492Z     D=7168,
2025-05-07T20:32:34.5150698Z     scale_ub=None,
2025-05-07T20:32:34.5150974Z     contiguous=False,
2025-05-07T20:32:34.5151210Z     compiled=False,
2025-05-07T20:32:34.5151431Z )
2025-05-07T20:32:34.5151764Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.5152308Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:34.5152575Z 
2025-05-07T20:32:34.5152657Z     @given(
2025-05-07T20:32:34.5152901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.5153225Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.5153538Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.5153882Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.5154228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.5154519Z     )
2025-05-07T20:32:34.5154880Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.5155415Z     def test_silu_mul_quant(
2025-05-07T20:32:34.5155985Z         self,
2025-05-07T20:32:34.5156185Z         T: int,
2025-05-07T20:32:34.5156394Z         D: int,
2025-05-07T20:32:34.5156623Z         scale_ub: Optional[float],
2025-05-07T20:32:34.5156907Z         contiguous: bool,
2025-05-07T20:32:34.5157161Z         compiled: bool,
2025-05-07T20:32:34.5157395Z     ) -> None:
2025-05-07T20:32:34.5157616Z         torch.manual_seed(2025)
2025-05-07T20:32:34.5157873Z     
2025-05-07T20:32:34.5158202Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.5158558Z     
2025-05-07T20:32:34.5158763Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.5159066Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.5159382Z         x = x_sign * x_clamp
2025-05-07T20:32:34.5159634Z         x0 = x[:, :D]
2025-05-07T20:32:34.5159860Z         x1 = x[:, D:]
2025-05-07T20:32:34.5160074Z     
2025-05-07T20:32:34.5160276Z         if contiguous:
2025-05-07T20:32:34.5160518Z             x0 = x0.contiguous()
2025-05-07T20:32:34.5160784Z             x1 = x1.contiguous()
2025-05-07T20:32:34.5161037Z     
2025-05-07T20:32:34.5161240Z         if scale_ub is not None:
2025-05-07T20:32:34.5161524Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.5161867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.5162185Z             )
2025-05-07T20:32:34.5162386Z         else:
2025-05-07T20:32:34.5162602Z             scale_ub_tensor = None
2025-05-07T20:32:34.5162863Z     
2025-05-07T20:32:34.5163106Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.5163424Z             op = silu_mul_quant
2025-05-07T20:32:34.5163683Z             if compiled:
2025-05-07T20:32:34.5164020Z                 op = torch.compile(op)
2025-05-07T20:32:34.5164326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.5164609Z     
2025-05-07T20:32:34.5164815Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.5164985Z 
2025-05-07T20:32:34.5165089Z moe/activation_test.py:117: 
2025-05-07T20:32:34.5165398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.5165737Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.5166042Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.5166750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.5167459Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.5168016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.5168784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.5169461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.5170012Z     kernel = self.compile(
2025-05-07T20:32:34.5170647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.5171322Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.5171729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.5172018Z 
2025-05-07T20:32:34.5172233Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad26bac0>
2025-05-07T20:32:34.5173342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.5174752Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad76b7f0>}
2025-05-07T20:32:34.5176132Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.5177249Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad2aa930>
2025-05-07T20:32:34.5177545Z 
2025-05-07T20:32:34.5177725Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.5178327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.5178810Z                            module_map=module_map)
2025-05-07T20:32:34.5179187Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.5179558Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.5179822Z E       ^
2025-05-07T20:32:34.5180304Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.5180775Z 
2025-05-07T20:32:34.5181207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.5181732Z 
2025-05-07T20:32:34.5181849Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.5182272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.5182692Z     T=2048,
2025-05-07T20:32:34.5182892Z     D=7168,
2025-05-07T20:32:34.5183092Z     scale_ub=None,
2025-05-07T20:32:34.5183320Z     contiguous=False,
2025-05-07T20:32:34.5183556Z     compiled=True,
2025-05-07T20:32:34.5183766Z )
2025-05-07T20:32:34.6183350Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.6183899Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:34.6184217Z 
2025-05-07T20:32:34.6184591Z     @given(
2025-05-07T20:32:34.6184837Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.6185162Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.6185478Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6185825Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6186165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6186451Z     )
2025-05-07T20:32:34.6186812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6187272Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6187517Z         self,
2025-05-07T20:32:34.6187724Z         T: int,
2025-05-07T20:32:34.6187951Z         D: int,
2025-05-07T20:32:34.6188200Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6188484Z         contiguous: bool,
2025-05-07T20:32:34.6188731Z         compiled: bool,
2025-05-07T20:32:34.6188970Z     ) -> None:
2025-05-07T20:32:34.6189194Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6189450Z     
2025-05-07T20:32:34.6189734Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6190080Z     
2025-05-07T20:32:34.6190286Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.6190668Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.6190986Z         x = x_sign * x_clamp
2025-05-07T20:32:34.6191236Z         x0 = x[:, :D]
2025-05-07T20:32:34.6191483Z         x1 = x[:, D:]
2025-05-07T20:32:34.6191703Z     
2025-05-07T20:32:34.6191983Z         if contiguous:
2025-05-07T20:32:34.6192228Z             x0 = x0.contiguous()
2025-05-07T20:32:34.6200768Z             x1 = x1.contiguous()
2025-05-07T20:32:34.6201047Z     
2025-05-07T20:32:34.6201254Z         if scale_ub is not None:
2025-05-07T20:32:34.6201540Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.6201889Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.6202194Z             )
2025-05-07T20:32:34.6202400Z         else:
2025-05-07T20:32:34.6202617Z             scale_ub_tensor = None
2025-05-07T20:32:34.6202871Z     
2025-05-07T20:32:34.6203115Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.6203568Z             op = silu_mul_quant
2025-05-07T20:32:34.6203820Z             if compiled:
2025-05-07T20:32:34.6204082Z                 op = torch.compile(op)
2025-05-07T20:32:34.6204391Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.6204662Z     
2025-05-07T20:32:34.6204864Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.6205034Z 
2025-05-07T20:32:34.6205143Z moe/activation_test.py:117: 
2025-05-07T20:32:34.6205440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.6205775Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.6206066Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.6206640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.6207208Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.6207879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.6208637Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.6209181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.6209871Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.6210551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.6211097Z     kernel = self.compile(
2025-05-07T20:32:34.6211646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.6212315Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.6212774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.6213007Z 
2025-05-07T20:32:34.6213226Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad219270>
2025-05-07T20:32:34.6214334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.6215761Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad24caf0>}
2025-05-07T20:32:34.6217142Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.6218269Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad3a3670>
2025-05-07T20:32:34.6218615Z 
2025-05-07T20:32:34.6218796Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.6219329Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.6219862Z                            module_map=module_map)
2025-05-07T20:32:34.6220238Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.6220593Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.6220861Z E       ^
2025-05-07T20:32:34.6221396Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.6221854Z 
2025-05-07T20:32:34.6222288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.6222806Z 
2025-05-07T20:32:34.6222912Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.6223340Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.6223751Z     T=4096,
2025-05-07T20:32:34.6223938Z     D=7168,
2025-05-07T20:32:34.6224139Z     scale_ub=None,
2025-05-07T20:32:34.6224413Z     contiguous=False,
2025-05-07T20:32:34.6224637Z     compiled=True,
2025-05-07T20:32:34.6224850Z )
2025-05-07T20:32:34.6225176Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.6225669Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:34.6225946Z 
2025-05-07T20:32:34.6226025Z     @given(
2025-05-07T20:32:34.6226259Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.6226576Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.6226879Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.6227211Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.6227542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.6227824Z     )
2025-05-07T20:32:34.6228183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.6228629Z     def test_silu_mul_quant(
2025-05-07T20:32:34.6228865Z         self,
2025-05-07T20:32:34.6229065Z         T: int,
2025-05-07T20:32:34.6229263Z         D: int,
2025-05-07T20:32:34.6229482Z         scale_ub: Optional[float],
2025-05-07T20:32:34.6229754Z         contiguous: bool,
2025-05-07T20:32:34.6229995Z         compiled: bool,
2025-05-07T20:32:34.6230216Z     ) -> None:
2025-05-07T20:32:34.6230440Z         torch.manual_seed(2025)
2025-05-07T20:32:34.6230686Z     
2025-05-07T20:32:34.6230957Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.6231306Z     
2025-05-07T20:32:34.6231503Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.6231801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.6232109Z         x = x_sign * x_clamp
2025-05-07T20:32:34.6232352Z         x0 = x[:, :D]
2025-05-07T20:32:34.6232573Z         x1 = x[:, D:]
2025-05-07T20:32:34.6232833Z     
2025-05-07T20:32:34.6233027Z         if contiguous:
2025-05-07T20:32:34.6233263Z             x0 = x0.contiguous()
2025-05-07T20:32:34.6233524Z             x1 = x1.contiguous()
2025-05-07T20:32:34.6233771Z     
2025-05-07T20:32:34.6233969Z         if scale_ub is not None:
2025-05-07T20:32:34.6234246Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.6234588Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.6234901Z             )
2025-05-07T20:32:34.6235094Z         else:
2025-05-07T20:32:34.6235310Z             scale_ub_tensor = None
2025-05-07T20:32:34.6235566Z     
2025-05-07T20:32:34.6235798Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.6236117Z             op = silu_mul_quant
2025-05-07T20:32:34.6236368Z             if compiled:
2025-05-07T20:32:34.6236618Z                 op = torch.compile(op)
2025-05-07T20:32:34.6236912Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.6237187Z     
2025-05-07T20:32:34.6237389Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.6237554Z 
2025-05-07T20:32:34.6237656Z moe/activation_test.py:117: 
2025-05-07T20:32:34.6237961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.6238391Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.6238676Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.6239246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.6239856Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.6240523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.6241217Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.6241761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.6242453Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.6243120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.6243706Z     kernel = self.compile(
2025-05-07T20:32:34.6244260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.6244929Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.6245325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.6245562Z 
2025-05-07T20:32:34.6245774Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad799f60>
2025-05-07T20:32:34.6246873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.6248277Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad24c280>}
2025-05-07T20:32:34.6249698Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.6250741Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad3ae7b0>
2025-05-07T20:32:34.6251040Z 
2025-05-07T20:32:34.6251210Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.6251742Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.6252216Z                            module_map=module_map)
2025-05-07T20:32:34.6252585Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.6252945Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.6253258Z E       ^
2025-05-07T20:32:34.6253725Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.6254191Z 
2025-05-07T20:32:34.6254615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.6255132Z 
2025-05-07T20:32:34.9664773Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.9665754Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.9666617Z     T=16384,
2025-05-07T20:32:34.9667019Z     D=5120,
2025-05-07T20:32:34.9667411Z     scale_ub=1200.0,
2025-05-07T20:32:34.9667781Z     contiguous=False,
2025-05-07T20:32:34.9668018Z     compiled=False,
2025-05-07T20:32:34.9668232Z )
2025-05-07T20:32:34.9668567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.9669102Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:34.9669398Z 
2025-05-07T20:32:34.9669487Z     @given(
2025-05-07T20:32:34.9669730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.9670066Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.9670649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.9670992Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.9671334Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.9671633Z     )
2025-05-07T20:32:34.9672078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.9672540Z     def test_silu_mul_quant(
2025-05-07T20:32:34.9672799Z         self,
2025-05-07T20:32:34.9673001Z         T: int,
2025-05-07T20:32:34.9673214Z         D: int,
2025-05-07T20:32:34.9673450Z         scale_ub: Optional[float],
2025-05-07T20:32:34.9673730Z         contiguous: bool,
2025-05-07T20:32:34.9673989Z         compiled: bool,
2025-05-07T20:32:34.9674235Z     ) -> None:
2025-05-07T20:32:34.9674466Z         torch.manual_seed(2025)
2025-05-07T20:32:34.9674715Z     
2025-05-07T20:32:34.9675009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.9675463Z     
2025-05-07T20:32:34.9675665Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.9675973Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.9676300Z         x = x_sign * x_clamp
2025-05-07T20:32:34.9676544Z         x0 = x[:, :D]
2025-05-07T20:32:34.9676781Z         x1 = x[:, D:]
2025-05-07T20:32:34.9677008Z     
2025-05-07T20:32:34.9677206Z         if contiguous:
2025-05-07T20:32:34.9677450Z             x0 = x0.contiguous()
2025-05-07T20:32:34.9677721Z             x1 = x1.contiguous()
2025-05-07T20:32:34.9677969Z     
2025-05-07T20:32:34.9678170Z         if scale_ub is not None:
2025-05-07T20:32:34.9678454Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.9678795Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.9679120Z             )
2025-05-07T20:32:34.9679330Z         else:
2025-05-07T20:32:34.9679544Z             scale_ub_tensor = None
2025-05-07T20:32:34.9679808Z     
2025-05-07T20:32:34.9680066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.9680389Z             op = silu_mul_quant
2025-05-07T20:32:34.9680652Z             if compiled:
2025-05-07T20:32:34.9680914Z                 op = torch.compile(op)
2025-05-07T20:32:34.9681219Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.9681508Z     
2025-05-07T20:32:34.9681714Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.9681885Z 
2025-05-07T20:32:34.9681991Z moe/activation_test.py:117: 
2025-05-07T20:32:34.9682297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.9682643Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.9682944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.9683728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.9684447Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.9685006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.9685704Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.9686386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.9686946Z     kernel = self.compile(
2025-05-07T20:32:34.9687503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.9688213Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.9688630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.9688858Z 
2025-05-07T20:32:34.9689085Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad3bc070>
2025-05-07T20:32:34.9690265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.9691685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad24ed40>}
2025-05-07T20:32:34.9693095Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.9694143Z context = <triton._C.libtriton.ir.context object at 0x7fa7ace13b70>
2025-05-07T20:32:34.9694440Z 
2025-05-07T20:32:34.9694621Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.9695152Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.9695635Z                            module_map=module_map)
2025-05-07T20:32:34.9696056Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.9696436Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.9696700Z E       ^
2025-05-07T20:32:34.9697184Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.9697647Z 
2025-05-07T20:32:34.9698237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.9698759Z 
2025-05-07T20:32:34.9698876Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.9699297Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.9699713Z     T=16384,
2025-05-07T20:32:34.9699920Z     D=5120,
2025-05-07T20:32:34.9700124Z     scale_ub=1200.0,
2025-05-07T20:32:34.9700360Z     contiguous=True,
2025-05-07T20:32:34.9700592Z     compiled=True,
2025-05-07T20:32:34.9700801Z )
2025-05-07T20:32:34.9701141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.9701658Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:34.9701937Z 
2025-05-07T20:32:34.9702019Z     @given(
2025-05-07T20:32:34.9702264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.9702594Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.9702910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.9703247Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.9703590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.9703888Z     )
2025-05-07T20:32:34.9704248Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.9704756Z     def test_silu_mul_quant(
2025-05-07T20:32:34.9705010Z         self,
2025-05-07T20:32:34.9705213Z         T: int,
2025-05-07T20:32:34.9705430Z         D: int,
2025-05-07T20:32:34.9705665Z         scale_ub: Optional[float],
2025-05-07T20:32:34.9705948Z         contiguous: bool,
2025-05-07T20:32:34.9706203Z         compiled: bool,
2025-05-07T20:32:34.9706435Z     ) -> None:
2025-05-07T20:32:34.9706659Z         torch.manual_seed(2025)
2025-05-07T20:32:34.9706913Z     
2025-05-07T20:32:34.9707199Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.9707553Z     
2025-05-07T20:32:34.9707755Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.9708062Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.9708428Z         x = x_sign * x_clamp
2025-05-07T20:32:34.9708674Z         x0 = x[:, :D]
2025-05-07T20:32:34.9708900Z         x1 = x[:, D:]
2025-05-07T20:32:34.9709120Z     
2025-05-07T20:32:34.9709310Z         if contiguous:
2025-05-07T20:32:34.9709554Z             x0 = x0.contiguous()
2025-05-07T20:32:34.9709823Z             x1 = x1.contiguous()
2025-05-07T20:32:34.9710066Z     
2025-05-07T20:32:34.9710270Z         if scale_ub is not None:
2025-05-07T20:32:34.9710557Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.9710949Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.9711268Z             )
2025-05-07T20:32:34.9711474Z         else:
2025-05-07T20:32:34.9711688Z             scale_ub_tensor = None
2025-05-07T20:32:34.9711953Z     
2025-05-07T20:32:34.9712241Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.9712564Z             op = silu_mul_quant
2025-05-07T20:32:34.9712819Z             if compiled:
2025-05-07T20:32:34.9713077Z                 op = torch.compile(op)
2025-05-07T20:32:34.9713385Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.9713663Z     
2025-05-07T20:32:34.9713866Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:34.9714035Z 
2025-05-07T20:32:34.9714148Z moe/activation_test.py:117: 
2025-05-07T20:32:34.9714447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.9714786Z moe/activation_test.py:115: in fn
2025-05-07T20:32:34.9715124Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.9715700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:34.9716273Z     return fn(*args, **kwargs)
2025-05-07T20:32:34.9716951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:34.9717660Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:34.9718204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.9718951Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.9719632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.9720181Z     kernel = self.compile(
2025-05-07T20:32:34.9720732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.9721414Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.9721823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.9722052Z 
2025-05-07T20:32:34.9722269Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad2d1060>
2025-05-07T20:32:34.9723371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.9724821Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad24e830>}
2025-05-07T20:32:34.9726198Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.9727249Z context = <triton._C.libtriton.ir.context object at 0x7fa7ace4cd30>
2025-05-07T20:32:34.9727543Z 
2025-05-07T20:32:34.9727716Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.9728255Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.9728744Z                            module_map=module_map)
2025-05-07T20:32:34.9729118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.9729479Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.9729752Z E       ^
2025-05-07T20:32:34.9730234Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.9730689Z 
2025-05-07T20:32:34.9731116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.9731644Z 
2025-05-07T20:32:35.1602372Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.1602879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.1603298Z     T=16384,
2025-05-07T20:32:35.1603514Z     D=5120,
2025-05-07T20:32:35.1603795Z     scale_ub=None,
2025-05-07T20:32:35.1604025Z     contiguous=False,
2025-05-07T20:32:35.1604263Z     compiled=True,
2025-05-07T20:32:35.1604474Z )
2025-05-07T20:32:35.1604809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.1605323Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.1605609Z 
2025-05-07T20:32:35.1605692Z     @given(
2025-05-07T20:32:35.1605939Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.1606267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.1606590Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.1607026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.1607374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.1607671Z     )
2025-05-07T20:32:35.1608079Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.1608539Z     def test_silu_mul_quant(
2025-05-07T20:32:35.1608794Z         self,
2025-05-07T20:32:35.1608994Z         T: int,
2025-05-07T20:32:35.1609203Z         D: int,
2025-05-07T20:32:35.1609433Z         scale_ub: Optional[float],
2025-05-07T20:32:35.1609712Z         contiguous: bool,
2025-05-07T20:32:35.1609963Z         compiled: bool,
2025-05-07T20:32:35.1610203Z     ) -> None:
2025-05-07T20:32:35.1610430Z         torch.manual_seed(2025)
2025-05-07T20:32:35.1610685Z     
2025-05-07T20:32:35.1610978Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.1611335Z     
2025-05-07T20:32:35.1611534Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.1611845Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.1612172Z         x = x_sign * x_clamp
2025-05-07T20:32:35.1612417Z         x0 = x[:, :D]
2025-05-07T20:32:35.1612645Z         x1 = x[:, D:]
2025-05-07T20:32:35.1612865Z     
2025-05-07T20:32:35.1613060Z         if contiguous:
2025-05-07T20:32:35.1613306Z             x0 = x0.contiguous()
2025-05-07T20:32:35.1613583Z             x1 = x1.contiguous()
2025-05-07T20:32:35.1613829Z     
2025-05-07T20:32:35.1614038Z         if scale_ub is not None:
2025-05-07T20:32:35.1614328Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.1614672Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.1614993Z             )
2025-05-07T20:32:35.1615199Z         else:
2025-05-07T20:32:35.1615509Z             scale_ub_tensor = None
2025-05-07T20:32:35.1615783Z     
2025-05-07T20:32:35.1616029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.1616348Z             op = silu_mul_quant
2025-05-07T20:32:35.1616609Z             if compiled:
2025-05-07T20:32:35.1616874Z                 op = torch.compile(op)
2025-05-07T20:32:35.1617183Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.1617462Z     
2025-05-07T20:32:35.1617666Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.1617835Z 
2025-05-07T20:32:35.1617951Z moe/activation_test.py:117: 
2025-05-07T20:32:35.1618369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.1618760Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.1619058Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.1619633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.1620212Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.1620896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.1621619Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.1622240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.1622946Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.1623632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.1624226Z     kernel = self.compile(
2025-05-07T20:32:35.1624782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.1625462Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.1625875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.1626110Z 
2025-05-07T20:32:35.1626327Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad31ca60>
2025-05-07T20:32:35.1627448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.1628929Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad24f760>}
2025-05-07T20:32:35.1630323Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.1631378Z context = <triton._C.libtriton.ir.context object at 0x7fa7acd9f1b0>
2025-05-07T20:32:35.1631673Z 
2025-05-07T20:32:35.1631850Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.1632393Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.1632885Z                            module_map=module_map)
2025-05-07T20:32:35.1633260Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.1633631Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.1641749Z E       ^
2025-05-07T20:32:35.1642246Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.1642722Z 
2025-05-07T20:32:35.1643149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.1643681Z 
2025-05-07T20:32:35.1643792Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.1644223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.1644711Z     T=2048,
2025-05-07T20:32:35.1644910Z     D=5120,
2025-05-07T20:32:35.1645113Z     scale_ub=None,
2025-05-07T20:32:35.1645330Z     contiguous=False,
2025-05-07T20:32:35.1645575Z     compiled=True,
2025-05-07T20:32:35.1645791Z )
2025-05-07T20:32:35.2670149Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2670735Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.2671013Z 
2025-05-07T20:32:35.2671096Z     @given(
2025-05-07T20:32:35.2671354Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2671681Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2672001Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2672338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2672681Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2672976Z     )
2025-05-07T20:32:35.2673344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2673800Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2674056Z         self,
2025-05-07T20:32:35.2674260Z         T: int,
2025-05-07T20:32:35.2674476Z         D: int,
2025-05-07T20:32:35.2674704Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2675221Z         contiguous: bool,
2025-05-07T20:32:35.2675482Z         compiled: bool,
2025-05-07T20:32:35.2675723Z     ) -> None:
2025-05-07T20:32:35.2675947Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2676200Z     
2025-05-07T20:32:35.2676572Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2676931Z     
2025-05-07T20:32:35.2677129Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2677438Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2677755Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2677997Z         x0 = x[:, :D]
2025-05-07T20:32:35.2678226Z         x1 = x[:, D:]
2025-05-07T20:32:35.2678468Z     
2025-05-07T20:32:35.2678689Z         if contiguous:
2025-05-07T20:32:35.2678933Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2679200Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2679537Z     
2025-05-07T20:32:35.2679741Z         if scale_ub is not None:
2025-05-07T20:32:35.2680028Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2680375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2680699Z             )
2025-05-07T20:32:35.2680903Z         else:
2025-05-07T20:32:35.2681120Z             scale_ub_tensor = None
2025-05-07T20:32:35.2681389Z     
2025-05-07T20:32:35.2681637Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2681961Z             op = silu_mul_quant
2025-05-07T20:32:35.2682223Z             if compiled:
2025-05-07T20:32:35.2682479Z                 op = torch.compile(op)
2025-05-07T20:32:35.2682783Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2683063Z     
2025-05-07T20:32:35.2683269Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.2683436Z 
2025-05-07T20:32:35.2683540Z moe/activation_test.py:117: 
2025-05-07T20:32:35.2683840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2684182Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.2684469Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2685039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.2685612Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.2686288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.2686988Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.2687540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2688233Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2688999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2689544Z     kernel = self.compile(
2025-05-07T20:32:35.2690109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2690782Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2691187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2691424Z 
2025-05-07T20:32:35.2691640Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acef13f0>
2025-05-07T20:32:35.2692744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2694170Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acdd43a0>}
2025-05-07T20:32:35.2695595Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2696643Z context = <triton._C.libtriton.ir.context object at 0x7fa7acd61e70>
2025-05-07T20:32:35.2696947Z 
2025-05-07T20:32:35.2697158Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2697691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2698290Z                            module_map=module_map)
2025-05-07T20:32:35.2698688Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2699055Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.2699323Z E       ^
2025-05-07T20:32:35.2699800Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2700263Z 
2025-05-07T20:32:35.2700739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2701265Z 
2025-05-07T20:32:35.2701373Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.2701802Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.2702210Z     T=2048,
2025-05-07T20:32:35.2702404Z     D=5120,
2025-05-07T20:32:35.2702605Z     scale_ub=1200.0,
2025-05-07T20:32:35.2702833Z     contiguous=False,
2025-05-07T20:32:35.2703065Z     compiled=True,
2025-05-07T20:32:35.2703278Z )
2025-05-07T20:32:35.2703600Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.2704110Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.2704386Z 
2025-05-07T20:32:35.2704475Z     @given(
2025-05-07T20:32:35.2704714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.2705029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.2705352Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.2705691Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.2706022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.2706315Z     )
2025-05-07T20:32:35.2706676Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.2707120Z     def test_silu_mul_quant(
2025-05-07T20:32:35.2707371Z         self,
2025-05-07T20:32:35.2707577Z         T: int,
2025-05-07T20:32:35.2707778Z         D: int,
2025-05-07T20:32:35.2708022Z         scale_ub: Optional[float],
2025-05-07T20:32:35.2708335Z         contiguous: bool,
2025-05-07T20:32:35.2708576Z         compiled: bool,
2025-05-07T20:32:35.2708809Z     ) -> None:
2025-05-07T20:32:35.2709095Z         torch.manual_seed(2025)
2025-05-07T20:32:35.2709341Z     
2025-05-07T20:32:35.2709627Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.2709979Z     
2025-05-07T20:32:35.2710185Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.2710479Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.2710805Z         x = x_sign * x_clamp
2025-05-07T20:32:35.2711057Z         x0 = x[:, :D]
2025-05-07T20:32:35.2711275Z         x1 = x[:, D:]
2025-05-07T20:32:35.2711494Z     
2025-05-07T20:32:35.2711691Z         if contiguous:
2025-05-07T20:32:35.2711926Z             x0 = x0.contiguous()
2025-05-07T20:32:35.2712194Z             x1 = x1.contiguous()
2025-05-07T20:32:35.2712441Z     
2025-05-07T20:32:35.2712640Z         if scale_ub is not None:
2025-05-07T20:32:35.2712921Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.2713268Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.2713579Z             )
2025-05-07T20:32:35.2713783Z         else:
2025-05-07T20:32:35.2714002Z             scale_ub_tensor = None
2025-05-07T20:32:35.2714254Z     
2025-05-07T20:32:35.2714498Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.2714825Z             op = silu_mul_quant
2025-05-07T20:32:35.2715146Z             if compiled:
2025-05-07T20:32:35.2715403Z                 op = torch.compile(op)
2025-05-07T20:32:35.2715710Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2715989Z     
2025-05-07T20:32:35.2716184Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.2716403Z 
2025-05-07T20:32:35.2716507Z moe/activation_test.py:117: 
2025-05-07T20:32:35.2716805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2717134Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.2717426Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.2717994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.2718568Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.2719232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.2719981Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.2720531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.2721218Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.2721896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.2722440Z     kernel = self.compile(
2025-05-07T20:32:35.2722990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.2723651Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.2724063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.2724294Z 
2025-05-07T20:32:35.2724513Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acda0910>
2025-05-07T20:32:35.2725616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.2727016Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acdd4820>}
2025-05-07T20:32:35.2728443Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.2729491Z context = <triton._C.libtriton.ir.context object at 0x7fa7acc087b0>
2025-05-07T20:32:35.2729828Z 
2025-05-07T20:32:35.2730006Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.2730539Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.2731021Z                            module_map=module_map)
2025-05-07T20:32:35.2731400Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.2731767Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.2732032Z E       ^
2025-05-07T20:32:35.2732510Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.2732965Z 
2025-05-07T20:32:35.2733392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.2733912Z 
2025-05-07T20:32:35.6342462Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6342955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6343364Z     T=4096,
2025-05-07T20:32:35.6343555Z     D=5120,
2025-05-07T20:32:35.6343773Z     scale_ub=1200.0,
2025-05-07T20:32:35.6344005Z     contiguous=True,
2025-05-07T20:32:35.6344231Z     compiled=True,
2025-05-07T20:32:35.6344467Z )
2025-05-07T20:32:35.6345073Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6345581Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.6345985Z 
2025-05-07T20:32:35.6346066Z     @given(
2025-05-07T20:32:35.6346303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6346614Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6346924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6347261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6347594Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6347899Z     )
2025-05-07T20:32:35.6348380Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6348884Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6349229Z         self,
2025-05-07T20:32:35.6349428Z         T: int,
2025-05-07T20:32:35.6349627Z         D: int,
2025-05-07T20:32:35.6349848Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6350125Z         contiguous: bool,
2025-05-07T20:32:35.6350370Z         compiled: bool,
2025-05-07T20:32:35.6350599Z     ) -> None:
2025-05-07T20:32:35.6350825Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6351071Z     
2025-05-07T20:32:35.6351344Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6351689Z     
2025-05-07T20:32:35.6351885Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6352176Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6352491Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6352733Z         x0 = x[:, :D]
2025-05-07T20:32:35.6352955Z         x1 = x[:, D:]
2025-05-07T20:32:35.6353159Z     
2025-05-07T20:32:35.6353349Z         if contiguous:
2025-05-07T20:32:35.6353584Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6353843Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6354090Z     
2025-05-07T20:32:35.6354291Z         if scale_ub is not None:
2025-05-07T20:32:35.6354564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6354906Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6355221Z             )
2025-05-07T20:32:35.6355418Z         else:
2025-05-07T20:32:35.6355978Z             scale_ub_tensor = None
2025-05-07T20:32:35.6356238Z     
2025-05-07T20:32:35.6356470Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6356793Z             op = silu_mul_quant
2025-05-07T20:32:35.6357049Z             if compiled:
2025-05-07T20:32:35.6357297Z                 op = torch.compile(op)
2025-05-07T20:32:35.6357599Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6357989Z     
2025-05-07T20:32:35.6358190Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6358357Z 
2025-05-07T20:32:35.6358460Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6358763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6359099Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6359380Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6359960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.6360542Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.6361211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6361918Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6362468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6363169Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6363839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6364386Z     kernel = self.compile(
2025-05-07T20:32:35.6365007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6365681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6366138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6366375Z 
2025-05-07T20:32:35.6366587Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acddc850>
2025-05-07T20:32:35.6367700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6369187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acdd5360>}
2025-05-07T20:32:35.6370634Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6371684Z context = <triton._C.libtriton.ir.context object at 0x7fa7acc87670>
2025-05-07T20:32:35.6371987Z 
2025-05-07T20:32:35.6372155Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6372688Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6373159Z                            module_map=module_map)
2025-05-07T20:32:35.6373530Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6373895Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6374153Z E       ^
2025-05-07T20:32:35.6374629Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6375096Z 
2025-05-07T20:32:35.6375522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6376043Z 
2025-05-07T20:32:35.6376155Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6376572Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6376983Z     T=128,
2025-05-07T20:32:35.6377174Z     D=5120,
2025-05-07T20:32:35.6377365Z     scale_ub=1200.0,
2025-05-07T20:32:35.6377598Z     contiguous=False,
2025-05-07T20:32:35.6377825Z     compiled=True,
2025-05-07T20:32:35.6378140Z )
2025-05-07T20:32:35.7523758Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7525237Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:35.7525795Z 
2025-05-07T20:32:35.7525957Z     @given(
2025-05-07T20:32:35.7526434Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7527084Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7527712Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7528170Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7528512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7528801Z     )
2025-05-07T20:32:35.7529168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7529620Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7529866Z         self,
2025-05-07T20:32:35.7530073Z         T: int,
2025-05-07T20:32:35.7530282Z         D: int,
2025-05-07T20:32:35.7530516Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7530795Z         contiguous: bool,
2025-05-07T20:32:35.7531047Z         compiled: bool,
2025-05-07T20:32:35.7531287Z     ) -> None:
2025-05-07T20:32:35.7531510Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7531767Z     
2025-05-07T20:32:35.7532055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7532404Z     
2025-05-07T20:32:35.7532686Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7532993Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7533308Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7533562Z         x0 = x[:, :D]
2025-05-07T20:32:35.7533864Z         x1 = x[:, D:]
2025-05-07T20:32:35.7534075Z     
2025-05-07T20:32:35.7534276Z         if contiguous:
2025-05-07T20:32:35.7534535Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7534801Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7535054Z     
2025-05-07T20:32:35.7535261Z         if scale_ub is not None:
2025-05-07T20:32:35.7535538Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7535889Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7536209Z             )
2025-05-07T20:32:35.7536410Z         else:
2025-05-07T20:32:35.7536625Z             scale_ub_tensor = None
2025-05-07T20:32:35.7536975Z     
2025-05-07T20:32:35.7537224Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7537547Z             op = silu_mul_quant
2025-05-07T20:32:35.7537807Z             if compiled:
2025-05-07T20:32:35.7538162Z                 op = torch.compile(op)
2025-05-07T20:32:35.7538495Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7538806Z     
2025-05-07T20:32:35.7539010Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7539180Z 
2025-05-07T20:32:35.7539284Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7539586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7539922Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7540217Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7540791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7541363Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7542038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7542748Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7543298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7543995Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7544674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7545217Z     kernel = self.compile(
2025-05-07T20:32:35.7545777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7546506Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7546908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7547145Z 
2025-05-07T20:32:35.7547358Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7accc5300>
2025-05-07T20:32:35.7548468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7549938Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acdd6290>}
2025-05-07T20:32:35.7551314Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7552357Z context = <triton._C.libtriton.ir.context object at 0x7fa7accaddf0>
2025-05-07T20:32:35.7552655Z 
2025-05-07T20:32:35.7552826Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7553416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7553902Z                            module_map=module_map)
2025-05-07T20:32:35.7554271Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7554639Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7554956Z E       ^
2025-05-07T20:32:35.7555428Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7556185Z 
2025-05-07T20:32:35.7556614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7557142Z 
2025-05-07T20:32:35.7557251Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.7557681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.7558112Z     T=16384,
2025-05-07T20:32:35.7558418Z     D=7168,
2025-05-07T20:32:35.7558621Z     scale_ub=1200.0,
2025-05-07T20:32:35.7558848Z     contiguous=True,
2025-05-07T20:32:35.7559086Z     compiled=True,
2025-05-07T20:32:35.7559309Z )
2025-05-07T20:32:35.7559635Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.7560143Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:35.7560434Z 
2025-05-07T20:32:35.7560516Z     @given(
2025-05-07T20:32:35.7560759Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.7561077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.7561393Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.7561742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.7562083Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.7562379Z     )
2025-05-07T20:32:35.7562739Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.7563184Z     def test_silu_mul_quant(
2025-05-07T20:32:35.7563431Z         self,
2025-05-07T20:32:35.7563639Z         T: int,
2025-05-07T20:32:35.7563841Z         D: int,
2025-05-07T20:32:35.7564064Z         scale_ub: Optional[float],
2025-05-07T20:32:35.7564343Z         contiguous: bool,
2025-05-07T20:32:35.7564585Z         compiled: bool,
2025-05-07T20:32:35.7564817Z     ) -> None:
2025-05-07T20:32:35.7565043Z         torch.manual_seed(2025)
2025-05-07T20:32:35.7565290Z     
2025-05-07T20:32:35.7565582Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.7565933Z     
2025-05-07T20:32:35.7566139Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.7566435Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.7566752Z         x = x_sign * x_clamp
2025-05-07T20:32:35.7574993Z         x0 = x[:, :D]
2025-05-07T20:32:35.7575265Z         x1 = x[:, D:]
2025-05-07T20:32:35.7575495Z     
2025-05-07T20:32:35.7575698Z         if contiguous:
2025-05-07T20:32:35.7575949Z             x0 = x0.contiguous()
2025-05-07T20:32:35.7576237Z             x1 = x1.contiguous()
2025-05-07T20:32:35.7576507Z     
2025-05-07T20:32:35.7576707Z         if scale_ub is not None:
2025-05-07T20:32:35.7577015Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.7577367Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.7577688Z             )
2025-05-07T20:32:35.7577894Z         else:
2025-05-07T20:32:35.7578211Z             scale_ub_tensor = None
2025-05-07T20:32:35.7578497Z     
2025-05-07T20:32:35.7578735Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.7579058Z             op = silu_mul_quant
2025-05-07T20:32:35.7579318Z             if compiled:
2025-05-07T20:32:35.7579566Z                 op = torch.compile(op)
2025-05-07T20:32:35.7579873Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7580155Z     
2025-05-07T20:32:35.7580346Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.7580524Z 
2025-05-07T20:32:35.7580626Z moe/activation_test.py:117: 
2025-05-07T20:32:35.7581001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7581342Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.7581626Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.7582200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:35.7582836Z     return fn(*args, **kwargs)
2025-05-07T20:32:35.7583496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.7584194Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.7584741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.7585429Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.7586142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.7586688Z     kernel = self.compile(
2025-05-07T20:32:35.7587245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.7587928Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.7588361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.7588593Z 
2025-05-07T20:32:35.7588805Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acc57fa0>
2025-05-07T20:32:35.7589906Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.7591307Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acdd6d40>}
2025-05-07T20:32:35.7592673Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.7593714Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac9b4870>
2025-05-07T20:32:35.7594005Z 
2025-05-07T20:32:35.7594180Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.7594712Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.7595185Z                            module_map=module_map)
2025-05-07T20:32:35.7595607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.7595971Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.7596226Z E       ^
2025-05-07T20:32:35.7596703Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.7597163Z 
2025-05-07T20:32:35.7597598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.7598113Z 
2025-05-07T20:32:35.8942594Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8943575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8944401Z     T=16384,
2025-05-07T20:32:35.8944795Z     D=5120,
2025-05-07T20:32:35.8945192Z     scale_ub=1200.0,
2025-05-07T20:32:35.8945651Z     contiguous=True,
2025-05-07T20:32:35.8946096Z     compiled=False,
2025-05-07T20:32:35.8946517Z )
2025-05-07T20:32:35.8947167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8948002Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.8948338Z 
2025-05-07T20:32:35.8948418Z     @given(
2025-05-07T20:32:35.8948668Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8949215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8949538Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8949879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8950221Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8950584Z     )
2025-05-07T20:32:35.8950943Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8951395Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8951641Z         self,
2025-05-07T20:32:35.8951844Z         T: int,
2025-05-07T20:32:35.8952045Z         D: int,
2025-05-07T20:32:35.8952266Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8952549Z         contiguous: bool,
2025-05-07T20:32:35.8952800Z         compiled: bool,
2025-05-07T20:32:35.8953030Z     ) -> None:
2025-05-07T20:32:35.8953256Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8953590Z     
2025-05-07T20:32:35.8953867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8954222Z     
2025-05-07T20:32:35.8954426Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8954722Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8955041Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8955290Z         x0 = x[:, :D]
2025-05-07T20:32:35.8955514Z         x1 = x[:, D:]
2025-05-07T20:32:35.8956024Z     
2025-05-07T20:32:35.8956226Z         if contiguous:
2025-05-07T20:32:35.8956481Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8956763Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8957028Z     
2025-05-07T20:32:35.8957234Z         if scale_ub is not None:
2025-05-07T20:32:35.8957532Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8957914Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8958262Z             )
2025-05-07T20:32:35.8958463Z         else:
2025-05-07T20:32:35.8958728Z             scale_ub_tensor = None
2025-05-07T20:32:35.8959025Z     
2025-05-07T20:32:35.8959274Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8959631Z             op = silu_mul_quant
2025-05-07T20:32:35.8959906Z             if compiled:
2025-05-07T20:32:35.8960172Z                 op = torch.compile(op)
2025-05-07T20:32:35.8960510Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8960818Z     
2025-05-07T20:32:35.8961021Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8961212Z 
2025-05-07T20:32:35.8961322Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8961655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8962039Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8962443Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8963154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8963868Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8964420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8965121Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8965801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8966349Z     kernel = self.compile(
2025-05-07T20:32:35.8966904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8967573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8967983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8968210Z 
2025-05-07T20:32:35.8968429Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acc54640>
2025-05-07T20:32:35.8969596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.8971024Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acdd7ac0>}
2025-05-07T20:32:35.8972453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.8973502Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac93db70>
2025-05-07T20:32:35.8973798Z 
2025-05-07T20:32:35.8973978Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.8974509Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.8975055Z                            module_map=module_map)
2025-05-07T20:32:35.8975436Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.8975796Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.8976065Z E       ^
2025-05-07T20:32:35.8976544Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.8977005Z 
2025-05-07T20:32:35.8977436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.8977957Z 
2025-05-07T20:32:35.8978168Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.8978644Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.8979062Z     T=1,
2025-05-07T20:32:35.8979248Z     D=7168,
2025-05-07T20:32:35.8979450Z     scale_ub=1200.0,
2025-05-07T20:32:35.8979686Z     contiguous=False,
2025-05-07T20:32:35.8979917Z     compiled=False,
2025-05-07T20:32:35.8980131Z )
2025-05-07T20:32:35.8980463Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.8980961Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:35.8981238Z 
2025-05-07T20:32:35.8981318Z     @given(
2025-05-07T20:32:35.8981562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.8981884Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.8982197Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.8982539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.8982879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.8983166Z     )
2025-05-07T20:32:35.8983585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.8984042Z     def test_silu_mul_quant(
2025-05-07T20:32:35.8984286Z         self,
2025-05-07T20:32:35.8984493Z         T: int,
2025-05-07T20:32:35.8984699Z         D: int,
2025-05-07T20:32:35.8984921Z         scale_ub: Optional[float],
2025-05-07T20:32:35.8985205Z         contiguous: bool,
2025-05-07T20:32:35.8985461Z         compiled: bool,
2025-05-07T20:32:35.8985695Z     ) -> None:
2025-05-07T20:32:35.8985916Z         torch.manual_seed(2025)
2025-05-07T20:32:35.8986173Z     
2025-05-07T20:32:35.8986459Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.8986804Z     
2025-05-07T20:32:35.8987015Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.8987317Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.8987627Z         x = x_sign * x_clamp
2025-05-07T20:32:35.8987875Z         x0 = x[:, :D]
2025-05-07T20:32:35.8988097Z         x1 = x[:, D:]
2025-05-07T20:32:35.8988304Z     
2025-05-07T20:32:35.8988502Z         if contiguous:
2025-05-07T20:32:35.8988764Z             x0 = x0.contiguous()
2025-05-07T20:32:35.8989048Z             x1 = x1.contiguous()
2025-05-07T20:32:35.8989296Z     
2025-05-07T20:32:35.8989500Z         if scale_ub is not None:
2025-05-07T20:32:35.8989832Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.8990197Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.8990516Z             )
2025-05-07T20:32:35.8990714Z         else:
2025-05-07T20:32:35.8991007Z             scale_ub_tensor = None
2025-05-07T20:32:35.8991265Z     
2025-05-07T20:32:35.8991519Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.8991845Z             op = silu_mul_quant
2025-05-07T20:32:35.8992100Z             if compiled:
2025-05-07T20:32:35.8992358Z                 op = torch.compile(op)
2025-05-07T20:32:35.8992665Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8992942Z     
2025-05-07T20:32:35.8993150Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.8993320Z 
2025-05-07T20:32:35.8993430Z moe/activation_test.py:117: 
2025-05-07T20:32:35.8993728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.8994113Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.8994409Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.8995114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.8995817Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.8996371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.8997070Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.8997741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.8998287Z     kernel = self.compile(
2025-05-07T20:32:35.8998844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.8999518Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.8999919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.9000154Z 
2025-05-07T20:32:35.9000367Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac8657e0>
2025-05-07T20:32:35.9001465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.9002865Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac9744c0>}
2025-05-07T20:32:35.9004281Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.9005325Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac87d670>
2025-05-07T20:32:35.9005628Z 
2025-05-07T20:32:35.9005799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.9006333Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.9006812Z                            module_map=module_map)
2025-05-07T20:32:35.9007190Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.9007558Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.9007829Z E       ^
2025-05-07T20:32:35.9008304Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.9008819Z 
2025-05-07T20:32:35.9009244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.9009763Z 
2025-05-07T20:32:36.0920997Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.0921784Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.0922214Z     T=4096,
2025-05-07T20:32:36.0922406Z     D=7168,
2025-05-07T20:32:36.0922606Z     scale_ub=1200.0,
2025-05-07T20:32:36.0922843Z     contiguous=False,
2025-05-07T20:32:36.0923154Z     compiled=True,
2025-05-07T20:32:36.0923371Z )
2025-05-07T20:32:36.0923706Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.0924214Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:36.0924503Z 
2025-05-07T20:32:36.0924584Z     @given(
2025-05-07T20:32:36.0924824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.0925143Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.0925469Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.0925816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.0926250Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.0926539Z     )
2025-05-07T20:32:36.0926908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.0927370Z     def test_silu_mul_quant(
2025-05-07T20:32:36.0927616Z         self,
2025-05-07T20:32:36.0927822Z         T: int,
2025-05-07T20:32:36.0928041Z         D: int,
2025-05-07T20:32:36.0928305Z         scale_ub: Optional[float],
2025-05-07T20:32:36.0928588Z         contiguous: bool,
2025-05-07T20:32:36.0928841Z         compiled: bool,
2025-05-07T20:32:36.0929076Z     ) -> None:
2025-05-07T20:32:36.0929306Z         torch.manual_seed(2025)
2025-05-07T20:32:36.0929558Z     
2025-05-07T20:32:36.0929838Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.0930192Z     
2025-05-07T20:32:36.0930401Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.0930703Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.0931024Z         x = x_sign * x_clamp
2025-05-07T20:32:36.0931272Z         x0 = x[:, :D]
2025-05-07T20:32:36.0931495Z         x1 = x[:, D:]
2025-05-07T20:32:36.0931708Z     
2025-05-07T20:32:36.0931904Z         if contiguous:
2025-05-07T20:32:36.0932145Z             x0 = x0.contiguous()
2025-05-07T20:32:36.0932409Z             x1 = x1.contiguous()
2025-05-07T20:32:36.0932661Z     
2025-05-07T20:32:36.0932861Z         if scale_ub is not None:
2025-05-07T20:32:36.0933141Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.0933495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.0933812Z             )
2025-05-07T20:32:36.0934010Z         else:
2025-05-07T20:32:36.0934230Z             scale_ub_tensor = None
2025-05-07T20:32:36.0934489Z     
2025-05-07T20:32:36.0934812Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.0935142Z             op = silu_mul_quant
2025-05-07T20:32:36.0935400Z             if compiled:
2025-05-07T20:32:36.0935648Z                 op = torch.compile(op)
2025-05-07T20:32:36.0935960Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.0936253Z     
2025-05-07T20:32:36.0936456Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.0936625Z 
2025-05-07T20:32:36.0936728Z moe/activation_test.py:117: 
2025-05-07T20:32:36.0937036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.0937376Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.0937665Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.0938356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:36.0938986Z     return fn(*args, **kwargs)
2025-05-07T20:32:36.0939671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.0940377Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.0940930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.0941684Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.0942365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.0942957Z     kernel = self.compile(
2025-05-07T20:32:36.0943520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.0944199Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.0944605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.0944841Z 
2025-05-07T20:32:36.0945061Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac865b40>
2025-05-07T20:32:36.0946176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.0947662Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac9751b0>}
2025-05-07T20:32:36.0949043Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.0950108Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac82dd30>
2025-05-07T20:32:36.0950414Z 
2025-05-07T20:32:36.0950588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.0951132Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.0951612Z                            module_map=module_map)
2025-05-07T20:32:36.0951993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.0952362Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.0952635Z E       ^
2025-05-07T20:32:36.0953108Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.0953576Z 
2025-05-07T20:32:36.0954006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.0954531Z 
2025-05-07T20:32:36.0954645Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.0955068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.0955481Z     T=128,
2025-05-07T20:32:36.0955932Z     D=7168,
2025-05-07T20:32:36.0956133Z     scale_ub=1200.0,
2025-05-07T20:32:36.0956448Z     contiguous=False,
2025-05-07T20:32:36.0956786Z     compiled=True,
2025-05-07T20:32:36.0957191Z )
2025-05-07T20:32:36.1990579Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.1991185Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:36.1991464Z 
2025-05-07T20:32:36.1991547Z     @given(
2025-05-07T20:32:36.1991789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.1992112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.1992432Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.1992775Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.1993117Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.1993409Z     )
2025-05-07T20:32:36.1993783Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.1994239Z     def test_silu_mul_quant(
2025-05-07T20:32:36.1994504Z         self,
2025-05-07T20:32:36.1994705Z         T: int,
2025-05-07T20:32:36.1994913Z         D: int,
2025-05-07T20:32:36.1995143Z         scale_ub: Optional[float],
2025-05-07T20:32:36.1995425Z         contiguous: bool,
2025-05-07T20:32:36.1995679Z         compiled: bool,
2025-05-07T20:32:36.1996152Z     ) -> None:
2025-05-07T20:32:36.1996378Z         torch.manual_seed(2025)
2025-05-07T20:32:36.1996632Z     
2025-05-07T20:32:36.1996917Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.1997337Z     
2025-05-07T20:32:36.1997545Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.1997848Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.1998166Z         x = x_sign * x_clamp
2025-05-07T20:32:36.1998416Z         x0 = x[:, :D]
2025-05-07T20:32:36.1998667Z         x1 = x[:, D:]
2025-05-07T20:32:36.1998911Z     
2025-05-07T20:32:36.1999108Z         if contiguous:
2025-05-07T20:32:36.1999355Z             x0 = x0.contiguous()
2025-05-07T20:32:36.1999624Z             x1 = x1.contiguous()
2025-05-07T20:32:36.1999875Z     
2025-05-07T20:32:36.2000081Z         if scale_ub is not None:
2025-05-07T20:32:36.2000369Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.2000812Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.2001134Z             )
2025-05-07T20:32:36.2001336Z         else:
2025-05-07T20:32:36.2001551Z             scale_ub_tensor = None
2025-05-07T20:32:36.2001812Z     
2025-05-07T20:32:36.2002056Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.2002376Z             op = silu_mul_quant
2025-05-07T20:32:36.2002635Z             if compiled:
2025-05-07T20:32:36.2002892Z                 op = torch.compile(op)
2025-05-07T20:32:36.2003194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2003477Z     
2025-05-07T20:32:36.2003682Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.2003851Z 
2025-05-07T20:32:36.2003955Z moe/activation_test.py:117: 
2025-05-07T20:32:36.2004265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2004608Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.2004907Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2005480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:36.2006058Z     return fn(*args, **kwargs)
2025-05-07T20:32:36.2006734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.2007440Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.2007990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.2008698Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.2009459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.2010009Z     kernel = self.compile(
2025-05-07T20:32:36.2010571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.2011249Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.2011656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2011886Z 
2025-05-07T20:32:36.2012101Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac817bb0>
2025-05-07T20:32:36.2013225Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.2014654Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac9740d0>}
2025-05-07T20:32:36.2016039Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.2017661Z context = <triton._C.libtriton.ir.context object at 0x7fa7acb203f0>
2025-05-07T20:32:36.2017971Z 
2025-05-07T20:32:36.2018225Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.2018815Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.2027942Z                            module_map=module_map)
2025-05-07T20:32:36.2028375Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.2028752Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.2029023Z E       ^
2025-05-07T20:32:36.2029513Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.2029973Z 
2025-05-07T20:32:36.2030402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.2031019Z 
2025-05-07T20:32:36.2031127Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2031558Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2031971Z     T=2048,
2025-05-07T20:32:36.2032161Z     D=7168,
2025-05-07T20:32:36.2032364Z     scale_ub=None,
2025-05-07T20:32:36.2032592Z     contiguous=True,
2025-05-07T20:32:36.2032817Z     compiled=True,
2025-05-07T20:32:36.2033034Z )
2025-05-07T20:32:36.2033365Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2033866Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:36.2034146Z 
2025-05-07T20:32:36.2034223Z     @given(
2025-05-07T20:32:36.2034462Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2034787Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2035098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2035439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2035779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2036065Z     )
2025-05-07T20:32:36.2036428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2036882Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2037127Z         self,
2025-05-07T20:32:36.2037330Z         T: int,
2025-05-07T20:32:36.2037533Z         D: int,
2025-05-07T20:32:36.2037751Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2038041Z         contiguous: bool,
2025-05-07T20:32:36.2038330Z         compiled: bool,
2025-05-07T20:32:36.2038553Z     ) -> None:
2025-05-07T20:32:36.2038779Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2039028Z     
2025-05-07T20:32:36.2039356Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2039712Z     
2025-05-07T20:32:36.2039916Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.2040217Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.2040531Z         x = x_sign * x_clamp
2025-05-07T20:32:36.2040781Z         x0 = x[:, :D]
2025-05-07T20:32:36.2041005Z         x1 = x[:, D:]
2025-05-07T20:32:36.2041211Z     
2025-05-07T20:32:36.2041409Z         if contiguous:
2025-05-07T20:32:36.2041650Z             x0 = x0.contiguous()
2025-05-07T20:32:36.2041915Z             x1 = x1.contiguous()
2025-05-07T20:32:36.2042166Z     
2025-05-07T20:32:36.2042376Z         if scale_ub is not None:
2025-05-07T20:32:36.2042655Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.2043005Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.2043313Z             )
2025-05-07T20:32:36.2043505Z         else:
2025-05-07T20:32:36.2043724Z             scale_ub_tensor = None
2025-05-07T20:32:36.2043983Z     
2025-05-07T20:32:36.2044225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.2044540Z             op = silu_mul_quant
2025-05-07T20:32:36.2044793Z             if compiled:
2025-05-07T20:32:36.2045048Z                 op = torch.compile(op)
2025-05-07T20:32:36.2045431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2045704Z     
2025-05-07T20:32:36.2045900Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.2046066Z 
2025-05-07T20:32:36.2046172Z moe/activation_test.py:117: 
2025-05-07T20:32:36.2046507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2046839Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.2047127Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.2047695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:36.2048294Z     return fn(*args, **kwargs)
2025-05-07T20:32:36.2048987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.2049688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.2050276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.2050969Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.2051641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.2052188Z     kernel = self.compile(
2025-05-07T20:32:36.2052735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.2053402Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.2053802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.2054029Z 
2025-05-07T20:32:36.2054243Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acb51000>
2025-05-07T20:32:36.2055345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.2057092Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac976560>}
2025-05-07T20:32:36.2058595Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.2059638Z context = <triton._C.libtriton.ir.context object at 0x7fa7acb7b570>
2025-05-07T20:32:36.2059931Z 
2025-05-07T20:32:36.2060099Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.2060723Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.2061204Z                            module_map=module_map)
2025-05-07T20:32:36.2061573Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.2061941Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.2062204Z E       ^
2025-05-07T20:32:36.2062680Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.2063138Z 
2025-05-07T20:32:36.2063560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.2064083Z 
2025-05-07T20:32:36.2899951Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2900444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2900974Z     T=16384,
2025-05-07T20:32:36.2901250Z     D=5120,
2025-05-07T20:32:36.2901470Z     scale_ub=None,
2025-05-07T20:32:36.2901699Z     contiguous=False,
2025-05-07T20:32:36.2901932Z     compiled=False,
2025-05-07T20:32:36.2902158Z )
2025-05-07T20:32:36.2902491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2903183Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:36.2903484Z 
2025-05-07T20:32:36.2903566Z     @given(
2025-05-07T20:32:36.2903812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2904215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2904525Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2904866Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2905206Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2905493Z     )
2025-05-07T20:32:36.2905856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2906313Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2906556Z         self,
2025-05-07T20:32:36.2906759Z         T: int,
2025-05-07T20:32:36.2906960Z         D: int,
2025-05-07T20:32:36.2907266Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2907545Z         contiguous: bool,
2025-05-07T20:32:36.2907794Z         compiled: bool,
2025-05-07T20:32:36.2908026Z     ) -> None:
2025-05-07T20:32:36.2908246Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2908493Z     
2025-05-07T20:32:36.2908776Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2909122Z     
2025-05-07T20:32:36.2909323Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.2909620Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.2911709Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.2913650Z 
2025-05-07T20:32:36.2913778Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:36.2913996Z 
2025-05-07T20:32:36.2914102Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2914530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2914939Z     T=4096,
2025-05-07T20:32:36.2915133Z     D=7168,
2025-05-07T20:32:36.2915327Z     scale_ub=1200.0,
2025-05-07T20:32:36.2915556Z     contiguous=True,
2025-05-07T20:32:36.2915783Z     compiled=True,
2025-05-07T20:32:36.2915985Z )
2025-05-07T20:32:36.2916313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2916892Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:36.2917169Z 
2025-05-07T20:32:36.2917249Z     @given(
2025-05-07T20:32:36.2917490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2917811Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2918118Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2918459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2918795Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2919091Z     )
2025-05-07T20:32:36.2919443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2919890Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2920136Z         self,
2025-05-07T20:32:36.2920332Z         T: int,
2025-05-07T20:32:36.2920534Z         D: int,
2025-05-07T20:32:36.2920760Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2921030Z         contiguous: bool,
2025-05-07T20:32:36.2921278Z         compiled: bool,
2025-05-07T20:32:36.2921508Z     ) -> None:
2025-05-07T20:32:36.2921724Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2921972Z     
2025-05-07T20:32:36.2922250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2922641Z     
2025-05-07T20:32:36.2922843Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.2923140Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.2925227Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.2927186Z 
2025-05-07T20:32:36.2927309Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:36.2927564Z 
2025-05-07T20:32:36.2927676Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2928096Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2928507Z     T=16384,
2025-05-07T20:32:36.2928705Z     D=7168,
2025-05-07T20:32:36.2928896Z     scale_ub=None,
2025-05-07T20:32:36.2929120Z     contiguous=False,
2025-05-07T20:32:36.2929354Z     compiled=False,
2025-05-07T20:32:36.2929559Z )
2025-05-07T20:32:36.2929887Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2930391Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:36.2930673Z 
2025-05-07T20:32:36.2930755Z     @given(
2025-05-07T20:32:36.2930984Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2931309Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2931619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2931947Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2932283Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2932590Z     )
2025-05-07T20:32:36.2932953Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2933406Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2933647Z         self,
2025-05-07T20:32:36.2933849Z         T: int,
2025-05-07T20:32:36.2934052Z         D: int,
2025-05-07T20:32:36.2934270Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2934548Z         contiguous: bool,
2025-05-07T20:32:36.2934794Z         compiled: bool,
2025-05-07T20:32:36.2935016Z     ) -> None:
2025-05-07T20:32:36.2935240Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2935491Z     
2025-05-07T20:32:36.2935764Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2937924Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.2939948Z 
2025-05-07T20:32:36.2940072Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.2940294Z 
2025-05-07T20:32:36.2940399Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2940823Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2941229Z     T=2048,
2025-05-07T20:32:36.2941420Z     D=7168,
2025-05-07T20:32:36.2941618Z     scale_ub=1200.0,
2025-05-07T20:32:36.2941840Z     contiguous=True,
2025-05-07T20:32:36.2942066Z     compiled=True,
2025-05-07T20:32:36.2942272Z )
2025-05-07T20:32:36.2942595Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.2943143Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:36.2943417Z 
2025-05-07T20:32:36.2943500Z     @given(
2025-05-07T20:32:36.2943734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.2944090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.2944401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.2944736Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.2945066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.2945358Z     )
2025-05-07T20:32:36.2945720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.2946164Z     def test_silu_mul_quant(
2025-05-07T20:32:36.2946418Z         self,
2025-05-07T20:32:36.2946628Z         T: int,
2025-05-07T20:32:36.2946822Z         D: int,
2025-05-07T20:32:36.2947044Z         scale_ub: Optional[float],
2025-05-07T20:32:36.2947368Z         contiguous: bool,
2025-05-07T20:32:36.2947610Z         compiled: bool,
2025-05-07T20:32:36.2947837Z     ) -> None:
2025-05-07T20:32:36.2948077Z         torch.manual_seed(2025)
2025-05-07T20:32:36.2948360Z     
2025-05-07T20:32:36.2948632Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.2948986Z     
2025-05-07T20:32:36.2949185Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.2949476Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.2951527Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.2953429Z 
2025-05-07T20:32:36.2953550Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:36.2953766Z 
2025-05-07T20:32:36.2953877Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.2954306Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.2954709Z     T=2048,
2025-05-07T20:32:36.2954904Z     D=7168,
2025-05-07T20:32:36.2955104Z     scale_ub=None,
2025-05-07T20:32:36.2955320Z     contiguous=True,
2025-05-07T20:32:36.2955841Z     compiled=False,
2025-05-07T20:32:36.2956124Z )
2025-05-07T20:32:36.5960352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.5961353Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.5961647Z 
2025-05-07T20:32:36.5961731Z     @given(
2025-05-07T20:32:36.5961973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.5962298Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.5962625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.5962974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.5963318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.5963613Z     )
2025-05-07T20:32:36.5963976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.5964429Z     def test_silu_mul_quant(
2025-05-07T20:32:36.5964678Z         self,
2025-05-07T20:32:36.5964883Z         T: int,
2025-05-07T20:32:36.5965086Z         D: int,
2025-05-07T20:32:36.5965311Z         scale_ub: Optional[float],
2025-05-07T20:32:36.5965593Z         contiguous: bool,
2025-05-07T20:32:36.5966047Z         compiled: bool,
2025-05-07T20:32:36.5966280Z     ) -> None:
2025-05-07T20:32:36.5966507Z         torch.manual_seed(2025)
2025-05-07T20:32:36.5966757Z     
2025-05-07T20:32:36.5967041Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5967400Z     
2025-05-07T20:32:36.5967684Z >       x_sign = torch.sign(x)
2025-05-07T20:32:36.5969713Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.5971725Z 
2025-05-07T20:32:36.5971859Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:36.5972079Z 
2025-05-07T20:32:36.5972187Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.5972616Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.5973140Z     T=1,
2025-05-07T20:32:36.5973333Z     D=7168,
2025-05-07T20:32:36.5973536Z     scale_ub=1200.0,
2025-05-07T20:32:36.5973768Z     contiguous=True,
2025-05-07T20:32:36.5973995Z     compiled=False,
2025-05-07T20:32:36.5974213Z )
2025-05-07T20:32:36.5974547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.5975048Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:36.5975317Z 
2025-05-07T20:32:36.5975398Z     @given(
2025-05-07T20:32:36.5975636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.5975961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.5976273Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.5976618Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.5976964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.5977258Z     )
2025-05-07T20:32:36.5977623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.5978175Z     def test_silu_mul_quant(
2025-05-07T20:32:36.5978433Z         self,
2025-05-07T20:32:36.5978663Z         T: int,
2025-05-07T20:32:36.5978888Z         D: int,
2025-05-07T20:32:36.5979118Z         scale_ub: Optional[float],
2025-05-07T20:32:36.5979399Z         contiguous: bool,
2025-05-07T20:32:36.5979650Z         compiled: bool,
2025-05-07T20:32:36.5979886Z     ) -> None:
2025-05-07T20:32:36.5980108Z         torch.manual_seed(2025)
2025-05-07T20:32:36.5980359Z     
2025-05-07T20:32:36.5980645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.5980993Z     
2025-05-07T20:32:36.5981197Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.5981547Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.5981861Z         x = x_sign * x_clamp
2025-05-07T20:32:36.5982113Z         x0 = x[:, :D]
2025-05-07T20:32:36.5982341Z         x1 = x[:, D:]
2025-05-07T20:32:36.5982550Z     
2025-05-07T20:32:36.5982750Z         if contiguous:
2025-05-07T20:32:36.5982995Z             x0 = x0.contiguous()
2025-05-07T20:32:36.5983259Z             x1 = x1.contiguous()
2025-05-07T20:32:36.5983508Z     
2025-05-07T20:32:36.5983713Z         if scale_ub is not None:
2025-05-07T20:32:36.5984002Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.5984349Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.5984682Z             )
2025-05-07T20:32:36.5984886Z         else:
2025-05-07T20:32:36.5985102Z             scale_ub_tensor = None
2025-05-07T20:32:36.5985365Z     
2025-05-07T20:32:36.5985610Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.5985929Z             op = silu_mul_quant
2025-05-07T20:32:36.5986194Z             if compiled:
2025-05-07T20:32:36.5986450Z                 op = torch.compile(op)
2025-05-07T20:32:36.5986751Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.5987039Z     
2025-05-07T20:32:36.5987249Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.5987419Z 
2025-05-07T20:32:36.5987571Z moe/activation_test.py:117: 
2025-05-07T20:32:36.5987877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.5988219Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.5988552Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.5989261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.5989973Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.5990529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.5991231Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.5991914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.5992508Z     kernel = self.compile(
2025-05-07T20:32:36.5993072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.5993743Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.5994155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.5994383Z 
2025-05-07T20:32:36.5994607Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acbaf370>
2025-05-07T20:32:36.5995715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.5997123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac6884c0>}
2025-05-07T20:32:36.5998510Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.5999565Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac609df0>
2025-05-07T20:32:36.5999864Z 
2025-05-07T20:32:36.6000042Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.6000575Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.6001062Z                            module_map=module_map)
2025-05-07T20:32:36.6001437Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.6001802Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.6002114Z E       ^
2025-05-07T20:32:36.6002595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.6003055Z 
2025-05-07T20:32:36.6003488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.6004009Z 
2025-05-07T20:32:36.6004125Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.6004548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.6004967Z     T=128,
2025-05-07T20:32:36.6005168Z     D=5120,
2025-05-07T20:32:36.6005367Z     scale_ub=None,
2025-05-07T20:32:36.6005595Z     contiguous=True,
2025-05-07T20:32:36.6005832Z     compiled=False,
2025-05-07T20:32:36.6006043Z )
2025-05-07T20:32:36.6783690Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.6784500Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.6784894Z 
2025-05-07T20:32:36.6785007Z     @given(
2025-05-07T20:32:36.6785346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.6785688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.6786246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.6786600Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.6786944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.6787238Z     )
2025-05-07T20:32:36.6787677Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.6788133Z     def test_silu_mul_quant(
2025-05-07T20:32:36.6788388Z         self,
2025-05-07T20:32:36.6788598Z         T: int,
2025-05-07T20:32:36.6788809Z         D: int,
2025-05-07T20:32:36.6789036Z         scale_ub: Optional[float],
2025-05-07T20:32:36.6789323Z         contiguous: bool,
2025-05-07T20:32:36.6789592Z         compiled: bool,
2025-05-07T20:32:36.6789832Z     ) -> None:
2025-05-07T20:32:36.6790063Z         torch.manual_seed(2025)
2025-05-07T20:32:36.6790319Z     
2025-05-07T20:32:36.6790604Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.6799346Z     
2025-05-07T20:32:36.6799605Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.6799927Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.6800253Z         x = x_sign * x_clamp
2025-05-07T20:32:36.6800503Z         x0 = x[:, :D]
2025-05-07T20:32:36.6800719Z         x1 = x[:, D:]
2025-05-07T20:32:36.6800947Z     
2025-05-07T20:32:36.6801147Z         if contiguous:
2025-05-07T20:32:36.6801383Z             x0 = x0.contiguous()
2025-05-07T20:32:36.6801652Z             x1 = x1.contiguous()
2025-05-07T20:32:36.6801901Z     
2025-05-07T20:32:36.6802104Z         if scale_ub is not None:
2025-05-07T20:32:36.6802385Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.6802741Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.6803074Z             )
2025-05-07T20:32:36.6803275Z         else:
2025-05-07T20:32:36.6803502Z             scale_ub_tensor = None
2025-05-07T20:32:36.6803768Z     
2025-05-07T20:32:36.6804009Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.6804345Z             op = silu_mul_quant
2025-05-07T20:32:36.6804611Z             if compiled:
2025-05-07T20:32:36.6804865Z                 op = torch.compile(op)
2025-05-07T20:32:36.6805177Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.6805463Z     
2025-05-07T20:32:36.6805660Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.6805838Z 
2025-05-07T20:32:36.6805944Z moe/activation_test.py:117: 
2025-05-07T20:32:36.6806257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.6806602Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.6806894Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.6807733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.6808450Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.6809000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.6809702Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.6810386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.6810938Z     kernel = self.compile(
2025-05-07T20:32:36.6811491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.6812169Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.6812581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.6812811Z 
2025-05-07T20:32:36.6813037Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac629750>
2025-05-07T20:32:36.6814193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.6815622Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac688940>}
2025-05-07T20:32:36.6817046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.6818216Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac6351f0>
2025-05-07T20:32:36.6818516Z 
2025-05-07T20:32:36.6818689Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.6819232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.6819720Z                            module_map=module_map)
2025-05-07T20:32:36.6820149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.6820516Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.6820790Z E       ^
2025-05-07T20:32:36.6821272Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.6821735Z 
2025-05-07T20:32:36.6822161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.6822692Z 
2025-05-07T20:32:36.6822800Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.6823232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.6823642Z     T=128,
2025-05-07T20:32:36.6823832Z     D=7168,
2025-05-07T20:32:36.6824037Z     scale_ub=None,
2025-05-07T20:32:36.6824257Z     contiguous=True,
2025-05-07T20:32:36.6824486Z     compiled=False,
2025-05-07T20:32:36.6824705Z )
2025-05-07T20:32:36.6825031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.6825537Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.6825807Z 
2025-05-07T20:32:36.6825890Z     @given(
2025-05-07T20:32:36.6826127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.6826451Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.6826768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.6827106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.6827439Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.6827734Z     )
2025-05-07T20:32:36.6828096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.6828588Z     def test_silu_mul_quant(
2025-05-07T20:32:36.6828879Z         self,
2025-05-07T20:32:36.6829087Z         T: int,
2025-05-07T20:32:36.6829281Z         D: int,
2025-05-07T20:32:36.6829510Z         scale_ub: Optional[float],
2025-05-07T20:32:36.6829789Z         contiguous: bool,
2025-05-07T20:32:36.6830035Z         compiled: bool,
2025-05-07T20:32:36.6830271Z     ) -> None:
2025-05-07T20:32:36.6830494Z         torch.manual_seed(2025)
2025-05-07T20:32:36.6830739Z     
2025-05-07T20:32:36.6831029Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.6831384Z     
2025-05-07T20:32:36.6831586Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.6831883Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.6832199Z         x = x_sign * x_clamp
2025-05-07T20:32:36.6832450Z         x0 = x[:, :D]
2025-05-07T20:32:36.6832672Z         x1 = x[:, D:]
2025-05-07T20:32:36.6832886Z     
2025-05-07T20:32:36.6833084Z         if contiguous:
2025-05-07T20:32:36.6833321Z             x0 = x0.contiguous()
2025-05-07T20:32:36.6833591Z             x1 = x1.contiguous()
2025-05-07T20:32:36.6833841Z     
2025-05-07T20:32:36.6834035Z         if scale_ub is not None:
2025-05-07T20:32:36.6834318Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.6834712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.6835015Z             )
2025-05-07T20:32:36.6835210Z         else:
2025-05-07T20:32:36.6835425Z             scale_ub_tensor = None
2025-05-07T20:32:36.6835675Z     
2025-05-07T20:32:36.6835955Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.6836274Z             op = silu_mul_quant
2025-05-07T20:32:36.6836530Z             if compiled:
2025-05-07T20:32:36.6836778Z                 op = torch.compile(op)
2025-05-07T20:32:36.6837085Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.6837368Z     
2025-05-07T20:32:36.6837568Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.6837749Z 
2025-05-07T20:32:36.6837855Z moe/activation_test.py:117: 
2025-05-07T20:32:36.6838161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.6838491Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.6838850Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.6839585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.6840293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.6840835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.6841536Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.6842217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.6842759Z     kernel = self.compile(
2025-05-07T20:32:36.6843317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.6843989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.6844395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.6844623Z 
2025-05-07T20:32:36.6844840Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac669450>
2025-05-07T20:32:36.6845944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.6847346Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac689240>}
2025-05-07T20:32:36.6848764Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.6849816Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac740a70>
2025-05-07T20:32:36.6850115Z 
2025-05-07T20:32:36.6850288Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.6850823Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.6851305Z                            module_map=module_map)
2025-05-07T20:32:36.6851677Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.6852039Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.6852308Z E       ^
2025-05-07T20:32:36.6852778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.6853242Z 
2025-05-07T20:32:36.6853671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.6854198Z 
2025-05-07T20:32:36.6854305Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.6854733Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.6855162Z     T=2048,
2025-05-07T20:32:36.6855396Z     D=7168,
2025-05-07T20:32:36.6855927Z     scale_ub=1200.0,
2025-05-07T20:32:36.6856162Z     contiguous=True,
2025-05-07T20:32:36.6856384Z     compiled=False,
2025-05-07T20:32:36.6856594Z )
2025-05-07T20:32:36.7806307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.7807136Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:36.7807534Z 
2025-05-07T20:32:36.7807650Z     @given(
2025-05-07T20:32:36.7807905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.7808234Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.7808600Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.7808964Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.7809300Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.7809757Z     )
2025-05-07T20:32:36.7810126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.7810592Z     def test_silu_mul_quant(
2025-05-07T20:32:36.7810840Z         self,
2025-05-07T20:32:36.7811048Z         T: int,
2025-05-07T20:32:36.7811257Z         D: int,
2025-05-07T20:32:36.7811483Z         scale_ub: Optional[float],
2025-05-07T20:32:36.7811779Z         contiguous: bool,
2025-05-07T20:32:36.7812032Z         compiled: bool,
2025-05-07T20:32:36.7812265Z     ) -> None:
2025-05-07T20:32:36.7812499Z         torch.manual_seed(2025)
2025-05-07T20:32:36.7812757Z     
2025-05-07T20:32:36.7813040Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.7815186Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.7817118Z 
2025-05-07T20:32:36.7817250Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.7817477Z 
2025-05-07T20:32:36.7817587Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.7818111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.7818521Z     T=1,
2025-05-07T20:32:36.7818721Z     D=5120,
2025-05-07T20:32:36.7818922Z     scale_ub=1200.0,
2025-05-07T20:32:36.7819150Z     contiguous=True,
2025-05-07T20:32:36.7819380Z     compiled=False,
2025-05-07T20:32:36.7819694Z )
2025-05-07T20:32:36.7820022Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.7820526Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:36.7820799Z 
2025-05-07T20:32:36.7820886Z     @given(
2025-05-07T20:32:36.7821129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.7821447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.7821764Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.7822113Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.7822450Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.7822763Z     )
2025-05-07T20:32:36.7823131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.7823587Z     def test_silu_mul_quant(
2025-05-07T20:32:36.7823841Z         self,
2025-05-07T20:32:36.7824046Z         T: int,
2025-05-07T20:32:36.7824252Z         D: int,
2025-05-07T20:32:36.7824482Z         scale_ub: Optional[float],
2025-05-07T20:32:36.7824769Z         contiguous: bool,
2025-05-07T20:32:36.7825016Z         compiled: bool,
2025-05-07T20:32:36.7825252Z     ) -> None:
2025-05-07T20:32:36.7825484Z         torch.manual_seed(2025)
2025-05-07T20:32:36.7825809Z     
2025-05-07T20:32:36.7826100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.7826455Z     
2025-05-07T20:32:36.7826657Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.7827063Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.7827384Z         x = x_sign * x_clamp
2025-05-07T20:32:36.7827634Z         x0 = x[:, :D]
2025-05-07T20:32:36.7827863Z         x1 = x[:, D:]
2025-05-07T20:32:36.7828084Z     
2025-05-07T20:32:36.7828300Z         if contiguous:
2025-05-07T20:32:36.7828573Z             x0 = x0.contiguous()
2025-05-07T20:32:36.7828845Z             x1 = x1.contiguous()
2025-05-07T20:32:36.7829097Z     
2025-05-07T20:32:36.7829298Z         if scale_ub is not None:
2025-05-07T20:32:36.7829585Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.7829936Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.7830301Z             )
2025-05-07T20:32:36.7830504Z         else:
2025-05-07T20:32:36.7830728Z             scale_ub_tensor = None
2025-05-07T20:32:36.7830985Z     
2025-05-07T20:32:36.7831231Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.7831562Z             op = silu_mul_quant
2025-05-07T20:32:36.7831819Z             if compiled:
2025-05-07T20:32:36.7832081Z                 op = torch.compile(op)
2025-05-07T20:32:36.7832393Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.7832673Z     
2025-05-07T20:32:36.7832877Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.7833055Z 
2025-05-07T20:32:36.7833161Z moe/activation_test.py:117: 
2025-05-07T20:32:36.7833467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.7833808Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.7834105Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.7834821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.7835532Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.7836090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.7836790Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.7837469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.7838015Z     kernel = self.compile(
2025-05-07T20:32:36.7838575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.7839351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.7839756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.7839994Z 
2025-05-07T20:32:36.7840209Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac66a470>
2025-05-07T20:32:36.7841318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.7842729Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac68a200>}
2025-05-07T20:32:36.7844106Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.7845157Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac7b2c70>
2025-05-07T20:32:36.7845459Z 
2025-05-07T20:32:36.7845630Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.7846219Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.7846712Z                            module_map=module_map)
2025-05-07T20:32:36.7847083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.7847452Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.7847764Z E       ^
2025-05-07T20:32:36.7848232Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.7848698Z 
2025-05-07T20:32:36.7849123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.7849652Z 
2025-05-07T20:32:36.7849762Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.7850193Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.7850600Z     T=2048,
2025-05-07T20:32:36.7850841Z     D=5120,
2025-05-07T20:32:36.7851043Z     scale_ub=None,
2025-05-07T20:32:36.7851261Z     contiguous=True,
2025-05-07T20:32:36.7851500Z     compiled=False,
2025-05-07T20:32:36.7851713Z )
2025-05-07T20:32:36.7852037Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.7852542Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.7852826Z 
2025-05-07T20:32:36.7852908Z     @given(
2025-05-07T20:32:36.7853151Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.7853470Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.7853785Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.7854123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.7854459Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.7854751Z     )
2025-05-07T20:32:36.7855111Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.7855847Z     def test_silu_mul_quant(
2025-05-07T20:32:36.7856102Z         self,
2025-05-07T20:32:36.7856305Z         T: int,
2025-05-07T20:32:36.7856503Z         D: int,
2025-05-07T20:32:36.7856729Z         scale_ub: Optional[float],
2025-05-07T20:32:36.7857008Z         contiguous: bool,
2025-05-07T20:32:36.7857251Z         compiled: bool,
2025-05-07T20:32:36.7857485Z     ) -> None:
2025-05-07T20:32:36.7857709Z         torch.manual_seed(2025)
2025-05-07T20:32:36.7857959Z     
2025-05-07T20:32:36.7858314Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.7858668Z     
2025-05-07T20:32:36.7858900Z >       x_sign = torch.sign(x)
2025-05-07T20:32:36.7860989Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.7862912Z 
2025-05-07T20:32:36.7863037Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:36.7863263Z 
2025-05-07T20:32:36.7863370Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.7863797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.7864209Z     T=16384,
2025-05-07T20:32:36.7864405Z     D=5120,
2025-05-07T20:32:36.7864605Z     scale_ub=None,
2025-05-07T20:32:36.7864827Z     contiguous=True,
2025-05-07T20:32:36.7865053Z     compiled=False,
2025-05-07T20:32:36.7865266Z )
2025-05-07T20:32:36.8837863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.8838635Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.8839036Z 
2025-05-07T20:32:36.8839146Z     @given(
2025-05-07T20:32:36.8839676Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.8840136Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.8840448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.8840872Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.8841209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.8841500Z     )
2025-05-07T20:32:36.8841852Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.8842301Z     def test_silu_mul_quant(
2025-05-07T20:32:36.8842551Z         self,
2025-05-07T20:32:36.8842746Z         T: int,
2025-05-07T20:32:36.8842950Z         D: int,
2025-05-07T20:32:36.8843178Z         scale_ub: Optional[float],
2025-05-07T20:32:36.8843453Z         contiguous: bool,
2025-05-07T20:32:36.8843700Z         compiled: bool,
2025-05-07T20:32:36.8844012Z     ) -> None:
2025-05-07T20:32:36.8844230Z         torch.manual_seed(2025)
2025-05-07T20:32:36.8844480Z     
2025-05-07T20:32:36.8844766Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.8846886Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.8848810Z 
2025-05-07T20:32:36.8848942Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.8849161Z 
2025-05-07T20:32:36.8849267Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.8849699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.8850115Z     T=4096,
2025-05-07T20:32:36.8850302Z     D=5120,
2025-05-07T20:32:36.8850504Z     scale_ub=None,
2025-05-07T20:32:36.8850727Z     contiguous=True,
2025-05-07T20:32:36.8850951Z     compiled=False,
2025-05-07T20:32:36.8851167Z )
2025-05-07T20:32:36.8851497Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.8851994Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.8852276Z 
2025-05-07T20:32:36.8852354Z     @given(
2025-05-07T20:32:36.8852592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.8852911Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.8853295Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.8853639Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.8853977Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.8854271Z     )
2025-05-07T20:32:36.8854635Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.8855091Z     def test_silu_mul_quant(
2025-05-07T20:32:36.8855338Z         self,
2025-05-07T20:32:36.8855907Z         T: int,
2025-05-07T20:32:36.8856123Z         D: int,
2025-05-07T20:32:36.8856346Z         scale_ub: Optional[float],
2025-05-07T20:32:36.8856626Z         contiguous: bool,
2025-05-07T20:32:36.8856873Z         compiled: bool,
2025-05-07T20:32:36.8857105Z     ) -> None:
2025-05-07T20:32:36.8857324Z         torch.manual_seed(2025)
2025-05-07T20:32:36.8857576Z     
2025-05-07T20:32:36.8857856Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.8860202Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.8862169Z 
2025-05-07T20:32:36.8862294Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.8862515Z 
2025-05-07T20:32:36.8862621Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.8863047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.8863455Z     T=2048,
2025-05-07T20:32:36.8863641Z     D=5120,
2025-05-07T20:32:36.8863839Z     scale_ub=None,
2025-05-07T20:32:36.8864060Z     contiguous=False,
2025-05-07T20:32:36.8864291Z     compiled=False,
2025-05-07T20:32:36.8864501Z )
2025-05-07T20:32:36.8864825Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.8865391Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:36.8865673Z 
2025-05-07T20:32:36.8865757Z     @given(
2025-05-07T20:32:36.8866010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.8866331Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.8866651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.8866987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.8867323Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.8867610Z     )
2025-05-07T20:32:36.8867967Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.8868416Z     def test_silu_mul_quant(
2025-05-07T20:32:36.8868657Z         self,
2025-05-07T20:32:36.8868863Z         T: int,
2025-05-07T20:32:36.8869064Z         D: int,
2025-05-07T20:32:36.8869281Z         scale_ub: Optional[float],
2025-05-07T20:32:36.8869559Z         contiguous: bool,
2025-05-07T20:32:36.8869808Z         compiled: bool,
2025-05-07T20:32:36.8870032Z     ) -> None:
2025-05-07T20:32:36.8870258Z         torch.manual_seed(2025)
2025-05-07T20:32:36.8870506Z     
2025-05-07T20:32:36.8878477Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.8880831Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.8882746Z 
2025-05-07T20:32:36.8882871Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.8883097Z 
2025-05-07T20:32:36.8883203Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.8883629Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.8884027Z     T=4096,
2025-05-07T20:32:36.8884218Z     D=7168,
2025-05-07T20:32:36.8884413Z     scale_ub=None,
2025-05-07T20:32:36.8884632Z     contiguous=True,
2025-05-07T20:32:36.8884854Z     compiled=True,
2025-05-07T20:32:36.8885059Z )
2025-05-07T20:32:36.8885383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.8885872Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:36.8886146Z 
2025-05-07T20:32:36.8886229Z     @given(
2025-05-07T20:32:36.8886465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.8886782Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.8887100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.8887438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.8887768Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.8888059Z     )
2025-05-07T20:32:36.8888497Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.8888976Z     def test_silu_mul_quant(
2025-05-07T20:32:36.8889216Z         self,
2025-05-07T20:32:36.8889463Z         T: int,
2025-05-07T20:32:36.8889666Z         D: int,
2025-05-07T20:32:36.8889884Z         scale_ub: Optional[float],
2025-05-07T20:32:36.8890163Z         contiguous: bool,
2025-05-07T20:32:36.8890411Z         compiled: bool,
2025-05-07T20:32:36.8890634Z     ) -> None:
2025-05-07T20:32:36.8890856Z         torch.manual_seed(2025)
2025-05-07T20:32:36.8891109Z     
2025-05-07T20:32:36.8891419Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.8893510Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.8895464Z 
2025-05-07T20:32:36.8895583Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.8895804Z 
2025-05-07T20:32:36.8895910Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.8896333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.8896735Z     T=2048,
2025-05-07T20:32:36.8896927Z     D=5120,
2025-05-07T20:32:36.8897129Z     scale_ub=1200.0,
2025-05-07T20:32:36.8897364Z     contiguous=False,
2025-05-07T20:32:36.8897590Z     compiled=False,
2025-05-07T20:32:36.8897805Z )
2025-05-07T20:32:36.8898226Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.8898729Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:36.8899013Z 
2025-05-07T20:32:36.8899092Z     @given(
2025-05-07T20:32:36.8899330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.8899646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.8899959Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.8900296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.8900627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.8900923Z     )
2025-05-07T20:32:36.8901283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.8901730Z     def test_silu_mul_quant(
2025-05-07T20:32:36.8902016Z         self,
2025-05-07T20:32:36.8902216Z         T: int,
2025-05-07T20:32:36.8902419Z         D: int,
2025-05-07T20:32:36.8902636Z         scale_ub: Optional[float],
2025-05-07T20:32:36.8902916Z         contiguous: bool,
2025-05-07T20:32:36.8903154Z         compiled: bool,
2025-05-07T20:32:36.8903381Z     ) -> None:
2025-05-07T20:32:36.8903603Z         torch.manual_seed(2025)
2025-05-07T20:32:36.8903843Z     
2025-05-07T20:32:36.8904123Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.8906209Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:36.8908102Z 
2025-05-07T20:32:36.8908241Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:36.8908493Z 
2025-05-07T20:32:36.8908650Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.8909068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.8909475Z     T=4096,
2025-05-07T20:32:36.8909671Z     D=7168,
2025-05-07T20:32:36.8909899Z     scale_ub=1200.0,
2025-05-07T20:32:36.8910126Z     contiguous=True,
2025-05-07T20:32:36.8910354Z     compiled=False,
2025-05-07T20:32:36.8910560Z )
2025-05-07T20:32:37.0183186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0183970Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.0184372Z 
2025-05-07T20:32:37.0184484Z     @given(
2025-05-07T20:32:37.0184834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0185252Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0185571Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0186188Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0186534Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0186822Z     )
2025-05-07T20:32:37.0187185Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0187639Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0187888Z         self,
2025-05-07T20:32:37.0188094Z         T: int,
2025-05-07T20:32:37.0188301Z         D: int,
2025-05-07T20:32:37.0188526Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0188807Z         contiguous: bool,
2025-05-07T20:32:37.0189059Z         compiled: bool,
2025-05-07T20:32:37.0189291Z     ) -> None:
2025-05-07T20:32:37.0189519Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0189772Z     
2025-05-07T20:32:37.0190053Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0192180Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.0194119Z 
2025-05-07T20:32:37.0194242Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.0194464Z 
2025-05-07T20:32:37.0194572Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0195001Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0195491Z     T=16384,
2025-05-07T20:32:37.0195696Z     D=7168,
2025-05-07T20:32:37.0195896Z     scale_ub=None,
2025-05-07T20:32:37.0196116Z     contiguous=False,
2025-05-07T20:32:37.0196351Z     compiled=True,
2025-05-07T20:32:37.0196569Z )
2025-05-07T20:32:37.0196895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0197402Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.0197689Z 
2025-05-07T20:32:37.0197769Z     @given(
2025-05-07T20:32:37.0198011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0198353Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0198690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0199027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0199359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0199650Z     )
2025-05-07T20:32:37.0200013Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0200459Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0200709Z         self,
2025-05-07T20:32:37.0200912Z         T: int,
2025-05-07T20:32:37.0201121Z         D: int,
2025-05-07T20:32:37.0201341Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0201743Z         contiguous: bool,
2025-05-07T20:32:37.0201998Z         compiled: bool,
2025-05-07T20:32:37.0202224Z     ) -> None:
2025-05-07T20:32:37.0202450Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0202705Z     
2025-05-07T20:32:37.0203059Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0205159Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.0207117Z 
2025-05-07T20:32:37.0207243Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.0207466Z 
2025-05-07T20:32:37.0207574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0207999Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0208407Z     T=4096,
2025-05-07T20:32:37.0208605Z     D=7168,
2025-05-07T20:32:37.0208811Z     scale_ub=None,
2025-05-07T20:32:37.0209056Z     contiguous=True,
2025-05-07T20:32:37.0209313Z     compiled=False,
2025-05-07T20:32:37.0209527Z )
2025-05-07T20:32:37.0209853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0210359Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.0210636Z 
2025-05-07T20:32:37.0210727Z     @given(
2025-05-07T20:32:37.0210961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0211283Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0211599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0211938Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0212272Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0212566Z     )
2025-05-07T20:32:37.0212923Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0213379Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0213627Z         self,
2025-05-07T20:32:37.0213828Z         T: int,
2025-05-07T20:32:37.0214035Z         D: int,
2025-05-07T20:32:37.0214259Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0214543Z         contiguous: bool,
2025-05-07T20:32:37.0214785Z         compiled: bool,
2025-05-07T20:32:37.0215015Z     ) -> None:
2025-05-07T20:32:37.0215291Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0215541Z     
2025-05-07T20:32:37.0215819Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0217914Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.0219990Z 
2025-05-07T20:32:37.0220112Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.0220327Z 
2025-05-07T20:32:37.0220441Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0220864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0221276Z     T=16384,
2025-05-07T20:32:37.0221475Z     D=7168,
2025-05-07T20:32:37.0221670Z     scale_ub=None,
2025-05-07T20:32:37.0221893Z     contiguous=True,
2025-05-07T20:32:37.0222128Z     compiled=False,
2025-05-07T20:32:37.0222337Z )
2025-05-07T20:32:37.0222715Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0223223Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.0223572Z 
2025-05-07T20:32:37.0223663Z     @given(
2025-05-07T20:32:37.0223897Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0224219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0224534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0224865Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0225202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0225497Z     )
2025-05-07T20:32:37.0225853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0226304Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0226601Z         self,
2025-05-07T20:32:37.0226802Z         T: int,
2025-05-07T20:32:37.0227000Z         D: int,
2025-05-07T20:32:37.0227228Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0227507Z         contiguous: bool,
2025-05-07T20:32:37.0227750Z         compiled: bool,
2025-05-07T20:32:37.0227979Z     ) -> None:
2025-05-07T20:32:37.0228211Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0228455Z     
2025-05-07T20:32:37.0228731Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0230841Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.0232750Z 
2025-05-07T20:32:37.0232871Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.0233088Z 
2025-05-07T20:32:37.0233205Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0233620Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0234029Z     T=16384,
2025-05-07T20:32:37.0234228Z     D=7168,
2025-05-07T20:32:37.0234445Z     scale_ub=1200.0,
2025-05-07T20:32:37.0234668Z     contiguous=True,
2025-05-07T20:32:37.0234897Z     compiled=False,
2025-05-07T20:32:37.0235108Z )
2025-05-07T20:32:37.0235427Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.0235984Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.0236266Z 
2025-05-07T20:32:37.0236352Z     @given(
2025-05-07T20:32:37.0236587Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.0236905Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.0237219Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.0237754Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.0238086Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.0238381Z     )
2025-05-07T20:32:37.0238734Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.0239173Z     def test_silu_mul_quant(
2025-05-07T20:32:37.0239417Z         self,
2025-05-07T20:32:37.0239615Z         T: int,
2025-05-07T20:32:37.0239809Z         D: int,
2025-05-07T20:32:37.0240033Z         scale_ub: Optional[float],
2025-05-07T20:32:37.0240307Z         contiguous: bool,
2025-05-07T20:32:37.0240546Z         compiled: bool,
2025-05-07T20:32:37.0240780Z     ) -> None:
2025-05-07T20:32:37.0241007Z         torch.manual_seed(2025)
2025-05-07T20:32:37.0241253Z     
2025-05-07T20:32:37.0241526Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.0243666Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.0245604Z 
2025-05-07T20:32:37.0245724Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.0245937Z 
2025-05-07T20:32:37.0246055Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.0246473Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.0246879Z     T=128,
2025-05-07T20:32:37.0247115Z     D=5120,
2025-05-07T20:32:37.0247313Z     scale_ub=1200.0,
2025-05-07T20:32:37.0247540Z     contiguous=False,
2025-05-07T20:32:37.0247772Z     compiled=False,
2025-05-07T20:32:37.0247983Z )
2025-05-07T20:32:37.1666287Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1666927Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.1667231Z 
2025-05-07T20:32:37.1667317Z     @given(
2025-05-07T20:32:37.1667551Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1667871Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1668187Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1668567Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1668912Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1669202Z     )
2025-05-07T20:32:37.1669563Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1670016Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1670263Z         self,
2025-05-07T20:32:37.1670487Z         T: int,
2025-05-07T20:32:37.1670688Z         D: int,
2025-05-07T20:32:37.1670907Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1671185Z         contiguous: bool,
2025-05-07T20:32:37.1671431Z         compiled: bool,
2025-05-07T20:32:37.1671658Z     ) -> None:
2025-05-07T20:32:37.1671882Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1672128Z     
2025-05-07T20:32:37.1672400Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1672749Z     
2025-05-07T20:32:37.1672950Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1673243Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1673557Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1674079Z         x0 = x[:, :D]
2025-05-07T20:32:37.1674322Z         x1 = x[:, D:]
2025-05-07T20:32:37.1674540Z     
2025-05-07T20:32:37.1674746Z         if contiguous:
2025-05-07T20:32:37.1674997Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1675279Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1675544Z     
2025-05-07T20:32:37.1675753Z         if scale_ub is not None:
2025-05-07T20:32:37.1676050Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1676426Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1676777Z             )
2025-05-07T20:32:37.1676975Z         else:
2025-05-07T20:32:37.1677201Z             scale_ub_tensor = None
2025-05-07T20:32:37.1677479Z     
2025-05-07T20:32:37.1677726Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1678080Z             op = silu_mul_quant
2025-05-07T20:32:37.1678356Z             if compiled:
2025-05-07T20:32:37.1678699Z                 op = torch.compile(op)
2025-05-07T20:32:37.1679093Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1679431Z     
2025-05-07T20:32:37.1679631Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1679801Z 
2025-05-07T20:32:37.1679902Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1680290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1680672Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1680979Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1681897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1682742Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1683385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1684205Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1685014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1685658Z     kernel = self.compile(
2025-05-07T20:32:37.1686380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1687179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1687646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1687918Z 
2025-05-07T20:32:37.1688164Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac507d30>
2025-05-07T20:32:37.1689561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1691327Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acabdea0>}
2025-05-07T20:32:37.1693035Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1694318Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac5d4670>
2025-05-07T20:32:37.1694663Z 
2025-05-07T20:32:37.1694859Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1695477Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1696038Z                            module_map=module_map)
2025-05-07T20:32:37.1696453Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1696850Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1697142Z E       ^
2025-05-07T20:32:37.1697742Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1698359Z 
2025-05-07T20:32:37.1698788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1699312Z 
2025-05-07T20:32:37.1699420Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1699841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1700249Z     T=2048,
2025-05-07T20:32:37.1700441Z     D=7168,
2025-05-07T20:32:37.1700638Z     scale_ub=None,
2025-05-07T20:32:37.1700859Z     contiguous=False,
2025-05-07T20:32:37.1701086Z     compiled=False,
2025-05-07T20:32:37.1701302Z )
2025-05-07T20:32:37.1701628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1702133Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.1702407Z 
2025-05-07T20:32:37.1702487Z     @given(
2025-05-07T20:32:37.1702728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1703047Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1703359Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1703698Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1704081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1704367Z     )
2025-05-07T20:32:37.1704720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1705204Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1705444Z         self,
2025-05-07T20:32:37.1705641Z         T: int,
2025-05-07T20:32:37.1705842Z         D: int,
2025-05-07T20:32:37.1706067Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1706338Z         contiguous: bool,
2025-05-07T20:32:37.1706582Z         compiled: bool,
2025-05-07T20:32:37.1706808Z     ) -> None:
2025-05-07T20:32:37.1707023Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1707271Z     
2025-05-07T20:32:37.1707551Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1709658Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.1711620Z 
2025-05-07T20:32:37.1711743Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.1711964Z 
2025-05-07T20:32:37.1712069Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1712493Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1712903Z     T=128,
2025-05-07T20:32:37.1713090Z     D=7168,
2025-05-07T20:32:37.1713286Z     scale_ub=1200.0,
2025-05-07T20:32:37.1713513Z     contiguous=True,
2025-05-07T20:32:37.1713739Z     compiled=True,
2025-05-07T20:32:37.1713946Z )
2025-05-07T20:32:37.2133967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.2134523Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.2134795Z 
2025-05-07T20:32:37.2134883Z     @given(
2025-05-07T20:32:37.2135120Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.2135444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.2135758Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.2136087Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.2136424Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.2136713Z     )
2025-05-07T20:32:37.2137255Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.2137702Z     def test_silu_mul_quant(
2025-05-07T20:32:37.2137950Z         self,
2025-05-07T20:32:37.2138269Z         T: int,
2025-05-07T20:32:37.2138466Z         D: int,
2025-05-07T20:32:37.2138693Z         scale_ub: Optional[float],
2025-05-07T20:32:37.2138974Z         contiguous: bool,
2025-05-07T20:32:37.2139214Z         compiled: bool,
2025-05-07T20:32:37.2139445Z     ) -> None:
2025-05-07T20:32:37.2139668Z         torch.manual_seed(2025)
2025-05-07T20:32:37.2139914Z     
2025-05-07T20:32:37.2140194Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.2140544Z     
2025-05-07T20:32:37.2140739Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.2141038Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.2141352Z         x = x_sign * x_clamp
2025-05-07T20:32:37.2141592Z         x0 = x[:, :D]
2025-05-07T20:32:37.2141815Z         x1 = x[:, D:]
2025-05-07T20:32:37.2142033Z     
2025-05-07T20:32:37.2142225Z         if contiguous:
2025-05-07T20:32:37.2142460Z             x0 = x0.contiguous()
2025-05-07T20:32:37.2142725Z             x1 = x1.contiguous()
2025-05-07T20:32:37.2142972Z     
2025-05-07T20:32:37.2143164Z         if scale_ub is not None:
2025-05-07T20:32:37.2143581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.2152154Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.2152471Z             )
2025-05-07T20:32:37.2152676Z         else:
2025-05-07T20:32:37.2153037Z             scale_ub_tensor = None
2025-05-07T20:32:37.2153287Z     
2025-05-07T20:32:37.2153530Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.2153850Z             op = silu_mul_quant
2025-05-07T20:32:37.2154105Z             if compiled:
2025-05-07T20:32:37.2154352Z                 op = torch.compile(op)
2025-05-07T20:32:37.2154652Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.2154928Z     
2025-05-07T20:32:37.2155122Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.2155296Z 
2025-05-07T20:32:37.2155396Z moe/activation_test.py:117: 
2025-05-07T20:32:37.2156097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.2156547Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.2156845Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.2157423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.2158005Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.2158726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.2159435Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.2159986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.2160678Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.2161355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.2161902Z     kernel = self.compile(
2025-05-07T20:32:37.2162459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.2163124Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.2163526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.2163755Z 
2025-05-07T20:32:37.2163976Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac451f00>
2025-05-07T20:32:37.2165083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.2166578Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acabf7f0>}
2025-05-07T20:32:37.2167958Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.2169054Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac4b2f30>
2025-05-07T20:32:37.2169350Z 
2025-05-07T20:32:37.2169527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.2170052Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.2170533Z                            module_map=module_map)
2025-05-07T20:32:37.2170910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.2171275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.2171542Z E       ^
2025-05-07T20:32:37.2172019Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.2172479Z 
2025-05-07T20:32:37.2172981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.2173501Z 
2025-05-07T20:32:37.2173608Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.2174031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.2174497Z     T=128,
2025-05-07T20:32:37.2174691Z     D=7168,
2025-05-07T20:32:37.2174888Z     scale_ub=1200.0,
2025-05-07T20:32:37.2175116Z     contiguous=True,
2025-05-07T20:32:37.2175345Z     compiled=False,
2025-05-07T20:32:37.2175549Z )
2025-05-07T20:32:37.2175878Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.2176378Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.2176653Z 
2025-05-07T20:32:37.2176730Z     @given(
2025-05-07T20:32:37.2176966Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.2177331Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.2177638Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.2178148Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.2178539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.2178830Z     )
2025-05-07T20:32:37.2179187Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.2179636Z     def test_silu_mul_quant(
2025-05-07T20:32:37.2179874Z         self,
2025-05-07T20:32:37.2180075Z         T: int,
2025-05-07T20:32:37.2180273Z         D: int,
2025-05-07T20:32:37.2180487Z         scale_ub: Optional[float],
2025-05-07T20:32:37.2180764Z         contiguous: bool,
2025-05-07T20:32:37.2181007Z         compiled: bool,
2025-05-07T20:32:37.2181227Z     ) -> None:
2025-05-07T20:32:37.2181450Z         torch.manual_seed(2025)
2025-05-07T20:32:37.2181695Z     
2025-05-07T20:32:37.2181969Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.2182316Z     
2025-05-07T20:32:37.2182513Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.2182805Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.2184853Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.2186755Z 
2025-05-07T20:32:37.2186943Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.2187167Z 
2025-05-07T20:32:37.2187274Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.2187699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.2188102Z     T=128,
2025-05-07T20:32:37.2188299Z     D=5120,
2025-05-07T20:32:37.2188498Z     scale_ub=1200.0,
2025-05-07T20:32:37.2188749Z     contiguous=True,
2025-05-07T20:32:37.2189074Z     compiled=True,
2025-05-07T20:32:37.2189357Z )
2025-05-07T20:32:37.2189793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.2190400Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.2190677Z 
2025-05-07T20:32:37.2190754Z     @given(
2025-05-07T20:32:37.2190986Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.2191294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.2191603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.2191944Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.2192273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.2192564Z     )
2025-05-07T20:32:37.2192918Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.2193436Z     def test_silu_mul_quant(
2025-05-07T20:32:37.2193676Z         self,
2025-05-07T20:32:37.2193870Z         T: int,
2025-05-07T20:32:37.2194066Z         D: int,
2025-05-07T20:32:37.2194279Z         scale_ub: Optional[float],
2025-05-07T20:32:37.2194599Z         contiguous: bool,
2025-05-07T20:32:37.2194842Z         compiled: bool,
2025-05-07T20:32:37.2195062Z     ) -> None:
2025-05-07T20:32:37.2195280Z         torch.manual_seed(2025)
2025-05-07T20:32:37.2195522Z     
2025-05-07T20:32:37.2195794Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.2196135Z     
2025-05-07T20:32:37.2196329Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.2196624Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.2198670Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.2200616Z 
2025-05-07T20:32:37.2200736Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.2200955Z 
2025-05-07T20:32:37.2201060Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.2201479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.2201879Z     T=128,
2025-05-07T20:32:37.2202073Z     D=7168,
2025-05-07T20:32:37.2202267Z     scale_ub=None,
2025-05-07T20:32:37.2202476Z     contiguous=True,
2025-05-07T20:32:37.2202699Z     compiled=True,
2025-05-07T20:32:37.2202906Z )
2025-05-07T20:32:37.4175937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4176499Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.4176777Z 
2025-05-07T20:32:37.4176861Z     @given(
2025-05-07T20:32:37.4177109Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4177433Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4177750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4178205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4178591Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4178887Z     )
2025-05-07T20:32:37.4179250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4179979Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4180238Z         self,
2025-05-07T20:32:37.4180447Z         T: int,
2025-05-07T20:32:37.4180653Z         D: int,
2025-05-07T20:32:37.4180887Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4181174Z         contiguous: bool,
2025-05-07T20:32:37.4181422Z         compiled: bool,
2025-05-07T20:32:37.4181652Z     ) -> None:
2025-05-07T20:32:37.4181880Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4182132Z     
2025-05-07T20:32:37.4182411Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4184542Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.4186477Z 
2025-05-07T20:32:37.4186601Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.4186902Z 
2025-05-07T20:32:37.4196674Z FAILED
2025-05-07T20:32:37.4196966Z 
2025-05-07T20:32:37.4197481Z =================================== FAILURES ===================================
2025-05-07T20:32:37.4198149Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:37.4199103Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:37.4199994Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:37.4200762Z   |     yield
2025-05-07T20:32:37.4201396Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run
2025-05-07T20:32:37.4202134Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:37.4202957Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
2025-05-07T20:32:37.4203842Z   |     method()
2025-05-07T20:32:37.4204750Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:37.4205704Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4206364Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:37.4207167Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:37.4207671Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:37.4208172Z   +-+---------------- 1 ----------------
2025-05-07T20:32:37.4208612Z     | Traceback (most recent call last):
2025-05-07T20:32:37.4209618Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:37.4210709Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4213629Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.4216465Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:37.4217172Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4217759Z     |     T=2048,
2025-05-07T20:32:37.4218210Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:37.4218749Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:37.4219276Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:37.4219794Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:37.4220256Z     | )
2025-05-07T20:32:37.4220515Z     | 
2025-05-07T20:32:37.4221254Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:37.4222097Z     +---------------- 2 ----------------
2025-05-07T20:32:37.4222516Z     | Traceback (most recent call last):
2025-05-07T20:32:37.4223549Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:37.4224647Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4227582Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.4230505Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:37.4231129Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4231692Z     |     T=128,
2025-05-07T20:32:37.4231978Z     |     D=7168,
2025-05-07T20:32:37.4232270Z     |     scale_ub=None,
2025-05-07T20:32:37.4232623Z     |     contiguous=True,
2025-05-07T20:32:37.4232965Z     |     compiled=True,
2025-05-07T20:32:37.4233279Z     | )
2025-05-07T20:32:37.4233640Z     | 
2025-05-07T20:32:37.4234390Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:37.4235226Z     +---------------- 3 ----------------
2025-05-07T20:32:37.4235637Z     | Traceback (most recent call last):
2025-05-07T20:32:37.4236692Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:37.4237786Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4240405Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.4242431Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:37.4242877Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4243341Z     |     T=128,
2025-05-07T20:32:37.4243618Z     |     D=5120,
2025-05-07T20:32:37.4243925Z     |     scale_ub=1200.0,
2025-05-07T20:32:37.4244282Z     |     contiguous=True,
2025-05-07T20:32:37.4244646Z     |     compiled=True,
2025-05-07T20:32:37.4244972Z     | )
2025-05-07T20:32:37.4245236Z     | 
2025-05-07T20:32:37.4246092Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:37.4246995Z     +---------------- 4 ----------------
2025-05-07T20:32:37.4247434Z     | Traceback (most recent call last):
2025-05-07T20:32:37.4248556Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:37.4249638Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4250317Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:37.4251026Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4251877Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:37.4252684Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4253333Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:37.4254407Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4255979Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:37.4257329Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4258790Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:37.4261009Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4262188Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:37.4263224Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4264200Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:37.4265251Z     |     fn()
2025-05-07T20:32:37.4266082Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:37.4266995Z     |     self.fn.run(
2025-05-07T20:32:37.4267757Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:37.4268607Z     |     kernel = self.compile(
2025-05-07T20:32:37.4269486Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:37.4270512Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4271545Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:37.4272701Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4273465Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4273980Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4274364Z     | ^
2025-05-07T20:32:37.4275036Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4275872Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:37.4276487Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:37.4277264Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4277914Z     |     T=1,  # or any other generated value
2025-05-07T20:32:37.4278529Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:37.4279037Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:37.4279562Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:37.4280100Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:37.4280545Z     | )
2025-05-07T20:32:37.4280809Z     | 
2025-05-07T20:32:37.4281551Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:37.4282444Z     +------------------------------------
2025-05-07T20:32:37.4282964Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:37.4283498Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4284106Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4284689Z     T=1,
2025-05-07T20:32:37.4284970Z     D=5120,
2025-05-07T20:32:37.4285259Z     scale_ub=None,
2025-05-07T20:32:37.4285583Z     contiguous=True,
2025-05-07T20:32:37.4285903Z     compiled=True,
2025-05-07T20:32:37.4286213Z )
2025-05-07T20:32:37.4286681Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4287452Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.4287849Z 
2025-05-07T20:32:37.4287966Z     @given(
2025-05-07T20:32:37.4288317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4288808Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4289316Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4289813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4290304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4290720Z     )
2025-05-07T20:32:37.4291238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4291893Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4292258Z         self,
2025-05-07T20:32:37.4292539Z         T: int,
2025-05-07T20:32:37.4292836Z         D: int,
2025-05-07T20:32:37.4293160Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4293619Z         contiguous: bool,
2025-05-07T20:32:37.4293979Z         compiled: bool,
2025-05-07T20:32:37.4294323Z     ) -> None:
2025-05-07T20:32:37.4294638Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4295000Z     
2025-05-07T20:32:37.4295402Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4295878Z     
2025-05-07T20:32:37.4296146Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4296573Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4297015Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4297357Z         x0 = x[:, :D]
2025-05-07T20:32:37.4297672Z         x1 = x[:, D:]
2025-05-07T20:32:37.4297972Z     
2025-05-07T20:32:37.4298363Z         if contiguous:
2025-05-07T20:32:37.4298706Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4299084Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4299440Z     
2025-05-07T20:32:37.4299729Z         if scale_ub is not None:
2025-05-07T20:32:37.4300156Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4300653Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4301123Z             )
2025-05-07T20:32:37.4301414Z         else:
2025-05-07T20:32:37.4301725Z             scale_ub_tensor = None
2025-05-07T20:32:37.4302097Z     
2025-05-07T20:32:37.4302437Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4302894Z             op = silu_mul_quant
2025-05-07T20:32:37.4303261Z             if compiled:
2025-05-07T20:32:37.4303622Z                 op = torch.compile(op)
2025-05-07T20:32:37.4304044Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4304443Z     
2025-05-07T20:32:37.4304730Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4305147Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4305620Z     
2025-05-07T20:32:37.4305944Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4306392Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4306797Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4307235Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4307729Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4308152Z     
2025-05-07T20:32:37.4308463Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4308766Z 
2025-05-07T20:32:37.4308914Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4309315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4309790Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4310290Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4331903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4332982Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4333781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4334864Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4335879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4336985Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4338206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4339310Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4340352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4341266Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4342093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4342864Z     fn()
2025-05-07T20:32:37.4343583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4344394Z     self.fn.run(
2025-05-07T20:32:37.4345071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4345852Z     kernel = self.compile(
2025-05-07T20:32:37.4346637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4347585Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4348161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4348494Z 
2025-05-07T20:32:37.4348789Z self = <triton.compiler.compiler.ASTSource object at 0x7fa89aabd390>
2025-05-07T20:32:37.4350334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4352316Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89ab60af0>}
2025-05-07T20:32:37.4354233Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4356053Z context = <triton._C.libtriton.ir.context object at 0x7fa8a0960af0>
2025-05-07T20:32:37.4356475Z 
2025-05-07T20:32:37.4356840Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4357612Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4358285Z                            module_map=module_map)
2025-05-07T20:32:37.4358789Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4359285Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4359670Z E       ^
2025-05-07T20:32:37.4360326Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4360970Z 
2025-05-07T20:32:37.4361553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4362301Z 
2025-05-07T20:32:37.4362451Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4363058Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4363637Z     T=2048,
2025-05-07T20:32:37.4363906Z     D=5120,
2025-05-07T20:32:37.4364189Z     scale_ub=1200.0,
2025-05-07T20:32:37.4364513Z     contiguous=True,
2025-05-07T20:32:37.4364839Z     compiled=False,
2025-05-07T20:32:37.4365142Z )
2025-05-07T20:32:37.4365674Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4366353Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.4366741Z 
2025-05-07T20:32:37.4366847Z     @given(
2025-05-07T20:32:37.4367254Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4367692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4368128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4368592Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4369056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4369451Z     )
2025-05-07T20:32:37.4369946Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4370569Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4370904Z         self,
2025-05-07T20:32:37.4371184Z         T: int,
2025-05-07T20:32:37.4371534Z         D: int,
2025-05-07T20:32:37.4371822Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4372202Z         contiguous: bool,
2025-05-07T20:32:37.4372531Z         compiled: bool,
2025-05-07T20:32:37.4372844Z     ) -> None:
2025-05-07T20:32:37.4373143Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4373473Z     
2025-05-07T20:32:37.4373867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4374336Z     
2025-05-07T20:32:37.4374603Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4375001Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4375437Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4375777Z         x0 = x[:, :D]
2025-05-07T20:32:37.4376068Z         x1 = x[:, D:]
2025-05-07T20:32:37.4376364Z     
2025-05-07T20:32:37.4376644Z         if contiguous:
2025-05-07T20:32:37.4376972Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4377333Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4377679Z     
2025-05-07T20:32:37.4377940Z         if scale_ub is not None:
2025-05-07T20:32:37.4378451Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4378987Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4379419Z             )
2025-05-07T20:32:37.4379696Z         else:
2025-05-07T20:32:37.4380001Z             scale_ub_tensor = None
2025-05-07T20:32:37.4380361Z     
2025-05-07T20:32:37.4380704Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4381149Z             op = silu_mul_quant
2025-05-07T20:32:37.4381505Z             if compiled:
2025-05-07T20:32:37.4381858Z                 op = torch.compile(op)
2025-05-07T20:32:37.4382274Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4382658Z     
2025-05-07T20:32:37.4382980Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4383214Z 
2025-05-07T20:32:37.4383357Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4383765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4384218Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4384636Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4385641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4386624Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4387378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4388345Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4389292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4390042Z     kernel = self.compile(
2025-05-07T20:32:37.4390825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4391759Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4392396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4392718Z 
2025-05-07T20:32:37.4393010Z self = <triton.compiler.compiler.ASTSource object at 0x7fa899b69540>
2025-05-07T20:32:37.4394504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4396470Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89aa39990>}
2025-05-07T20:32:37.4398473Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4400026Z context = <triton._C.libtriton.ir.context object at 0x7fa89b169070>
2025-05-07T20:32:37.4400434Z 
2025-05-07T20:32:37.4400649Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4401346Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4401955Z                            module_map=module_map)
2025-05-07T20:32:37.4402420Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4402931Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4403282Z E       ^
2025-05-07T20:32:37.4403950Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4404605Z 
2025-05-07T20:32:37.4405202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4405937Z 
2025-05-07T20:32:37.4406088Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4406667Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4407222Z     T=2048,
2025-05-07T20:32:37.4407487Z     D=5120,
2025-05-07T20:32:37.4407756Z     scale_ub=1200.0,
2025-05-07T20:32:37.4408062Z     contiguous=True,
2025-05-07T20:32:37.4408387Z     compiled=True,
2025-05-07T20:32:37.4408728Z )
2025-05-07T20:32:37.4409179Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4409893Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.4410285Z 
2025-05-07T20:32:37.4410395Z     @given(
2025-05-07T20:32:37.4410726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4411153Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4411672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4412030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4412364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4412657Z     )
2025-05-07T20:32:37.4413020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4413469Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4413721Z         self,
2025-05-07T20:32:37.4413922Z         T: int,
2025-05-07T20:32:37.4414120Z         D: int,
2025-05-07T20:32:37.4414348Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4414626Z         contiguous: bool,
2025-05-07T20:32:37.4414873Z         compiled: bool,
2025-05-07T20:32:37.4415099Z     ) -> None:
2025-05-07T20:32:37.4415323Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4415569Z     
2025-05-07T20:32:37.4415841Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4416187Z     
2025-05-07T20:32:37.4416389Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4416679Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4417001Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4417249Z         x0 = x[:, :D]
2025-05-07T20:32:37.4417465Z         x1 = x[:, D:]
2025-05-07T20:32:37.4417723Z     
2025-05-07T20:32:37.4417916Z         if contiguous:
2025-05-07T20:32:37.4418251Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4418521Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4418816Z     
2025-05-07T20:32:37.4419013Z         if scale_ub is not None:
2025-05-07T20:32:37.4419298Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4419647Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4419961Z             )
2025-05-07T20:32:37.4420160Z         else:
2025-05-07T20:32:37.4420384Z             scale_ub_tensor = None
2025-05-07T20:32:37.4420644Z     
2025-05-07T20:32:37.4420882Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4421207Z             op = silu_mul_quant
2025-05-07T20:32:37.4421464Z             if compiled:
2025-05-07T20:32:37.4421714Z                 op = torch.compile(op)
2025-05-07T20:32:37.4422096Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4422379Z     
2025-05-07T20:32:37.4422575Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4422872Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4423168Z     
2025-05-07T20:32:37.4423415Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4423760Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4424066Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4424392Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4424751Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4425069Z     
2025-05-07T20:32:37.4425279Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4425477Z 
2025-05-07T20:32:37.4425582Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4425884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4426228Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4426562Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4427369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4428136Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4428738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4429425Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4430132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4430916Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4431682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4432440Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4433180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4433842Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4434460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4434985Z     fn()
2025-05-07T20:32:37.4435506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4436103Z     self.fn.run(
2025-05-07T20:32:37.4436588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4437124Z     kernel = self.compile(
2025-05-07T20:32:37.4437684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4438403Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4438850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4439086Z 
2025-05-07T20:32:37.4439339Z self = <triton.compiler.compiler.ASTSource object at 0x7fa89ac9c4f0>
2025-05-07T20:32:37.4440442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4441843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8996d96c0>}
2025-05-07T20:32:37.4443213Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4444296Z context = <triton._C.libtriton.ir.context object at 0x7fa8995ed770>
2025-05-07T20:32:37.4444592Z 
2025-05-07T20:32:37.4444764Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4445303Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4445782Z                            module_map=module_map)
2025-05-07T20:32:37.4446151Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4446517Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4446795Z E       ^
2025-05-07T20:32:37.4447265Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4447728Z 
2025-05-07T20:32:37.4448154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4448735Z 
2025-05-07T20:32:37.4448845Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4449276Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4449677Z     T=16384,
2025-05-07T20:32:37.4449878Z     D=7168,
2025-05-07T20:32:37.4450088Z     scale_ub=1200.0,
2025-05-07T20:32:37.4450319Z     contiguous=False,
2025-05-07T20:32:37.4450555Z     compiled=False,
2025-05-07T20:32:37.4450770Z )
2025-05-07T20:32:37.4451093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4451608Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.4451892Z 
2025-05-07T20:32:37.4451980Z     @given(
2025-05-07T20:32:37.4452262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4452594Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4452914Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4453261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4453595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4453893Z     )
2025-05-07T20:32:37.4454257Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4454709Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4454960Z         self,
2025-05-07T20:32:37.4455163Z         T: int,
2025-05-07T20:32:37.4455365Z         D: int,
2025-05-07T20:32:37.4455855Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4456193Z         contiguous: bool,
2025-05-07T20:32:37.4456436Z         compiled: bool,
2025-05-07T20:32:37.4456669Z     ) -> None:
2025-05-07T20:32:37.4456895Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4457143Z     
2025-05-07T20:32:37.4457434Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4457788Z     
2025-05-07T20:32:37.4458057Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4458359Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4458774Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4459026Z         x0 = x[:, :D]
2025-05-07T20:32:37.4459246Z         x1 = x[:, D:]
2025-05-07T20:32:37.4459463Z     
2025-05-07T20:32:37.4459659Z         if contiguous:
2025-05-07T20:32:37.4459951Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4460221Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4460469Z     
2025-05-07T20:32:37.4460667Z         if scale_ub is not None:
2025-05-07T20:32:37.4460953Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4461302Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4461613Z             )
2025-05-07T20:32:37.4461813Z         else:
2025-05-07T20:32:37.4462035Z             scale_ub_tensor = None
2025-05-07T20:32:37.4462287Z     
2025-05-07T20:32:37.4462530Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4462848Z             op = silu_mul_quant
2025-05-07T20:32:37.4463175Z             if compiled:
2025-05-07T20:32:37.4463427Z                 op = torch.compile(op)
2025-05-07T20:32:37.4463733Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4464012Z     
2025-05-07T20:32:37.4464210Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4464381Z 
2025-05-07T20:32:37.4464487Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4464784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4465113Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4465401Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4466104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4466809Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4467351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4468045Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4468721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4469258Z     kernel = self.compile(
2025-05-07T20:32:37.4469811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4470483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4470881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4471107Z 
2025-05-07T20:32:37.4471317Z self = <triton.compiler.compiler.ASTSource object at 0x7fa899812aa0>
2025-05-07T20:32:37.4472481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4473876Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8996d88b0>}
2025-05-07T20:32:37.4475238Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4476281Z context = <triton._C.libtriton.ir.context object at 0x7fa8994ffdb0>
2025-05-07T20:32:37.4476570Z 
2025-05-07T20:32:37.4476739Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4477269Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4477747Z                            module_map=module_map)
2025-05-07T20:32:37.4478113Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4478512Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4478797Z E       ^
2025-05-07T20:32:37.4479315Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4479773Z 
2025-05-07T20:32:37.4480193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4480754Z 
2025-05-07T20:32:37.4480862Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4481288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4481688Z     T=1,
2025-05-07T20:32:37.4481881Z     D=7168,
2025-05-07T20:32:37.4482081Z     scale_ub=None,
2025-05-07T20:32:37.4482296Z     contiguous=True,
2025-05-07T20:32:37.4482527Z     compiled=True,
2025-05-07T20:32:37.4482741Z )
2025-05-07T20:32:37.4483068Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4483550Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.4483862Z 
2025-05-07T20:32:37.4483941Z     @given(
2025-05-07T20:32:37.4484181Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4484493Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4484806Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4485142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4485472Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4485762Z     )
2025-05-07T20:32:37.4486119Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4486565Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4486809Z         self,
2025-05-07T20:32:37.4487009Z         T: int,
2025-05-07T20:32:37.4487214Z         D: int,
2025-05-07T20:32:37.4487434Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4487712Z         contiguous: bool,
2025-05-07T20:32:37.4487959Z         compiled: bool,
2025-05-07T20:32:37.4488185Z     ) -> None:
2025-05-07T20:32:37.4488432Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4488707Z     
2025-05-07T20:32:37.4488981Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4489332Z     
2025-05-07T20:32:37.4489531Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4489822Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4490136Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4490379Z         x0 = x[:, :D]
2025-05-07T20:32:37.4490595Z         x1 = x[:, D:]
2025-05-07T20:32:37.4490807Z     
2025-05-07T20:32:37.4491006Z         if contiguous:
2025-05-07T20:32:37.4491236Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4491502Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4491748Z     
2025-05-07T20:32:37.4491996Z         if scale_ub is not None:
2025-05-07T20:32:37.4492273Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4492615Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4492930Z             )
2025-05-07T20:32:37.4493128Z         else:
2025-05-07T20:32:37.4493349Z             scale_ub_tensor = None
2025-05-07T20:32:37.4493608Z     
2025-05-07T20:32:37.4493841Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4494162Z             op = silu_mul_quant
2025-05-07T20:32:37.4494422Z             if compiled:
2025-05-07T20:32:37.4494670Z                 op = torch.compile(op)
2025-05-07T20:32:37.4494973Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4495255Z     
2025-05-07T20:32:37.4495454Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4495748Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4496040Z     
2025-05-07T20:32:37.4496281Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4496623Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4496920Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4497245Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4497650Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4497970Z     
2025-05-07T20:32:37.4498235Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4498433Z 
2025-05-07T20:32:37.4498545Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4498958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4499292Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4499620Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4500417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4501189Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4501742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4502485Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4503189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4503921Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4504684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4505437Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4506173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4506822Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4507433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4507957Z     fn()
2025-05-07T20:32:37.4508517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4509114Z     self.fn.run(
2025-05-07T20:32:37.4509586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4510128Z     kernel = self.compile(
2025-05-07T20:32:37.4510678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4511341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4511735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4519904Z 
2025-05-07T20:32:37.4520220Z self = <triton.compiler.compiler.ASTSource object at 0x7fa899978610>
2025-05-07T20:32:37.4521332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4522724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89937c940>}
2025-05-07T20:32:37.4524095Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4525137Z context = <triton._C.libtriton.ir.context object at 0x7fa8991ace30>
2025-05-07T20:32:37.4525428Z 
2025-05-07T20:32:37.4525603Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4526129Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4526605Z                            module_map=module_map)
2025-05-07T20:32:37.4526980Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4527393Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4527659Z E       ^
2025-05-07T20:32:37.4528133Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4528658Z 
2025-05-07T20:32:37.4529115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4529634Z 
2025-05-07T20:32:37.4529749Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4530166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4530573Z     T=4096,
2025-05-07T20:32:37.4530771Z     D=5120,
2025-05-07T20:32:37.4530964Z     scale_ub=None,
2025-05-07T20:32:37.4531190Z     contiguous=False,
2025-05-07T20:32:37.4531425Z     compiled=False,
2025-05-07T20:32:37.4531628Z )
2025-05-07T20:32:37.4531955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4532602Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.4532877Z 
2025-05-07T20:32:37.4532954Z     @given(
2025-05-07T20:32:37.4533189Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4533509Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4533823Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4534152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4534489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4534779Z     )
2025-05-07T20:32:37.4535129Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4535578Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4535826Z         self,
2025-05-07T20:32:37.4536017Z         T: int,
2025-05-07T20:32:37.4536214Z         D: int,
2025-05-07T20:32:37.4536436Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4536710Z         contiguous: bool,
2025-05-07T20:32:37.4536953Z         compiled: bool,
2025-05-07T20:32:37.4537185Z     ) -> None:
2025-05-07T20:32:37.4537399Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4537642Z     
2025-05-07T20:32:37.4537921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4538363Z     
2025-05-07T20:32:37.4538588Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4538889Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4539203Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4539439Z         x0 = x[:, :D]
2025-05-07T20:32:37.4539659Z         x1 = x[:, D:]
2025-05-07T20:32:37.4539867Z     
2025-05-07T20:32:37.4540051Z         if contiguous:
2025-05-07T20:32:37.4540291Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4540606Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4540842Z     
2025-05-07T20:32:37.4541039Z         if scale_ub is not None:
2025-05-07T20:32:37.4541321Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4541656Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4541964Z             )
2025-05-07T20:32:37.4542151Z         else:
2025-05-07T20:32:37.4542363Z             scale_ub_tensor = None
2025-05-07T20:32:37.4542616Z     
2025-05-07T20:32:37.4542846Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4543160Z             op = silu_mul_quant
2025-05-07T20:32:37.4543414Z             if compiled:
2025-05-07T20:32:37.4543663Z                 op = torch.compile(op)
2025-05-07T20:32:37.4543954Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4544250Z     
2025-05-07T20:32:37.4544532Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4544754Z 
2025-05-07T20:32:37.4544858Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4545156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4545486Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4545768Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4546525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4547231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4547778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4548562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4549239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4549779Z     kernel = self.compile(
2025-05-07T20:32:37.4550329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4550997Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4551444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4551670Z 
2025-05-07T20:32:37.4551891Z self = <triton.compiler.compiler.ASTSource object at 0x7fa899358a90>
2025-05-07T20:32:37.4552985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4554385Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89937d7e0>}
2025-05-07T20:32:37.4555982Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4557027Z context = <triton._C.libtriton.ir.context object at 0x7fa8991ff6f0>
2025-05-07T20:32:37.4557317Z 
2025-05-07T20:32:37.4557494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4558014Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4558496Z                            module_map=module_map)
2025-05-07T20:32:37.4558908Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4559259Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4559520Z E       ^
2025-05-07T20:32:37.4559991Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4560443Z 
2025-05-07T20:32:37.4560964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4561481Z 
2025-05-07T20:32:37.4561585Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4562004Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4562408Z     T=4096,
2025-05-07T20:32:37.4562592Z     D=7168,
2025-05-07T20:32:37.4562786Z     scale_ub=None,
2025-05-07T20:32:37.4563001Z     contiguous=False,
2025-05-07T20:32:37.4563223Z     compiled=False,
2025-05-07T20:32:37.4563427Z )
2025-05-07T20:32:37.4563746Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4564244Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.4564516Z 
2025-05-07T20:32:37.4564595Z     @given(
2025-05-07T20:32:37.4564827Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4565141Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4565444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4565780Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4566107Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4566387Z     )
2025-05-07T20:32:37.4566742Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4567248Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4567489Z         self,
2025-05-07T20:32:37.4567677Z         T: int,
2025-05-07T20:32:37.4567870Z         D: int,
2025-05-07T20:32:37.4568089Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4568442Z         contiguous: bool,
2025-05-07T20:32:37.4568706Z         compiled: bool,
2025-05-07T20:32:37.4568927Z     ) -> None:
2025-05-07T20:32:37.4569138Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4569380Z     
2025-05-07T20:32:37.4569654Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4569989Z     
2025-05-07T20:32:37.4570184Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4570487Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4570786Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4571025Z         x0 = x[:, :D]
2025-05-07T20:32:37.4571311Z         x1 = x[:, D:]
2025-05-07T20:32:37.4571513Z     
2025-05-07T20:32:37.4571697Z         if contiguous:
2025-05-07T20:32:37.4571929Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4572179Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4572417Z     
2025-05-07T20:32:37.4572612Z         if scale_ub is not None:
2025-05-07T20:32:37.4572891Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4573219Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4573528Z             )
2025-05-07T20:32:37.4573723Z         else:
2025-05-07T20:32:37.4573926Z             scale_ub_tensor = None
2025-05-07T20:32:37.4574177Z     
2025-05-07T20:32:37.4574408Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4574718Z             op = silu_mul_quant
2025-05-07T20:32:37.4574967Z             if compiled:
2025-05-07T20:32:37.4575215Z                 op = torch.compile(op)
2025-05-07T20:32:37.4575505Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4575780Z     
2025-05-07T20:32:37.4575974Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4576137Z 
2025-05-07T20:32:37.4576237Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4576529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4576856Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4577139Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4577830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4578642Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4579183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4579914Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4580591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4581129Z     kernel = self.compile(
2025-05-07T20:32:37.4581676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4582333Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4582727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4582954Z 
2025-05-07T20:32:37.4583167Z self = <triton.compiler.compiler.ASTSource object at 0x7fa8993ba860>
2025-05-07T20:32:37.4584250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4585637Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89937dfc0>}
2025-05-07T20:32:37.4587070Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4588105Z context = <triton._C.libtriton.ir.context object at 0x7fa898bc6770>
2025-05-07T20:32:37.4588434Z 
2025-05-07T20:32:37.4588632Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4589173Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4589645Z                            module_map=module_map)
2025-05-07T20:32:37.4590008Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4590366Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4590628Z E       ^
2025-05-07T20:32:37.4591094Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4591591Z 
2025-05-07T20:32:37.4592019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4592538Z 
2025-05-07T20:32:37.4592646Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4593058Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4593466Z     T=128,
2025-05-07T20:32:37.4593653Z     D=7168,
2025-05-07T20:32:37.4593839Z     scale_ub=None,
2025-05-07T20:32:37.4594054Z     contiguous=False,
2025-05-07T20:32:37.4594276Z     compiled=True,
2025-05-07T20:32:37.4594471Z )
2025-05-07T20:32:37.4594793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4595287Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.4595553Z 
2025-05-07T20:32:37.4595629Z     @given(
2025-05-07T20:32:37.4595858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4596174Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4596483Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4596815Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4597143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4597431Z     )
2025-05-07T20:32:37.4597784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4598219Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4598456Z         self,
2025-05-07T20:32:37.4598650Z         T: int,
2025-05-07T20:32:37.4598836Z         D: int,
2025-05-07T20:32:37.4599056Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4599332Z         contiguous: bool,
2025-05-07T20:32:37.4599565Z         compiled: bool,
2025-05-07T20:32:37.4599785Z     ) -> None:
2025-05-07T20:32:37.4600054Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4600290Z     
2025-05-07T20:32:37.4600566Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4600909Z     
2025-05-07T20:32:37.4601099Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4601392Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4601696Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4601932Z         x0 = x[:, :D]
2025-05-07T20:32:37.4602149Z         x1 = x[:, D:]
2025-05-07T20:32:37.4602356Z     
2025-05-07T20:32:37.4602534Z         if contiguous:
2025-05-07T20:32:37.4602767Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4603021Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4603258Z     
2025-05-07T20:32:37.4603444Z         if scale_ub is not None:
2025-05-07T20:32:37.4603721Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4604056Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4604361Z             )
2025-05-07T20:32:37.4604554Z         else:
2025-05-07T20:32:37.4604765Z             scale_ub_tensor = None
2025-05-07T20:32:37.4605011Z     
2025-05-07T20:32:37.4605241Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4605598Z             op = silu_mul_quant
2025-05-07T20:32:37.4605843Z             if compiled:
2025-05-07T20:32:37.4606088Z                 op = torch.compile(op)
2025-05-07T20:32:37.4606383Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4606689Z     
2025-05-07T20:32:37.4606881Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4607162Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4607449Z     
2025-05-07T20:32:37.4607682Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4608014Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4608309Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4608675Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4609028Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4609339Z     
2025-05-07T20:32:37.4609583Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4609784Z 
2025-05-07T20:32:37.4609882Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4610178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4610504Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4610834Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4611630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4611732Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4612100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4612328Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4612699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4612965Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4613367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4613627Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4614007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4614175Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4614524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4614601Z     fn()
2025-05-07T20:32:37.4615053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4615137Z     self.fn.run(
2025-05-07T20:32:37.4615482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4615582Z     kernel = self.compile(
2025-05-07T20:32:37.4615968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4616146Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4616276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4616281Z 
2025-05-07T20:32:37.4616487Z self = <triton.compiler.compiler.ASTSource object at 0x7fa899359750>
2025-05-07T20:32:37.4617281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4617790Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89939a560>}
2025-05-07T20:32:37.4618674Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4618932Z context = <triton._C.libtriton.ir.context object at 0x7fa898a17cb0>
2025-05-07T20:32:37.4618937Z 
2025-05-07T20:32:37.4619105Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4619377Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4619485Z                            module_map=module_map)
2025-05-07T20:32:37.4619653Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4619761Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4619835Z E       ^
2025-05-07T20:32:37.4620241Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4620249Z 
2025-05-07T20:32:37.4620669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4620674Z 
2025-05-07T20:32:37.4620782Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4621011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4621085Z     T=128,
2025-05-07T20:32:37.4621166Z     D=7168,
2025-05-07T20:32:37.4621248Z     scale_ub=None,
2025-05-07T20:32:37.4621334Z     contiguous=False,
2025-05-07T20:32:37.4621419Z     compiled=False,
2025-05-07T20:32:37.4621491Z )
2025-05-07T20:32:37.4621713Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4621894Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.4621898Z 
2025-05-07T20:32:37.4621978Z     @given(
2025-05-07T20:32:37.4622096Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4622204Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4622320Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4622444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4622562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4622636Z     )
2025-05-07T20:32:37.4622889Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4622983Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4623058Z         self,
2025-05-07T20:32:37.4623140Z         T: int,
2025-05-07T20:32:37.4623215Z         D: int,
2025-05-07T20:32:37.4623314Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4623457Z         contiguous: bool,
2025-05-07T20:32:37.4623549Z         compiled: bool,
2025-05-07T20:32:37.4623627Z     ) -> None:
2025-05-07T20:32:37.4623729Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4623803Z     
2025-05-07T20:32:37.4623986Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4624057Z     
2025-05-07T20:32:37.4624149Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4624277Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4624366Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4624448Z         x0 = x[:, :D]
2025-05-07T20:32:37.4624529Z         x1 = x[:, D:]
2025-05-07T20:32:37.4624601Z     
2025-05-07T20:32:37.4624685Z         if contiguous:
2025-05-07T20:32:37.4624779Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4624868Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4624942Z     
2025-05-07T20:32:37.4625039Z         if scale_ub is not None:
2025-05-07T20:32:37.4625147Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4625284Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4625365Z             )
2025-05-07T20:32:37.4625444Z         else:
2025-05-07T20:32:37.4625544Z             scale_ub_tensor = None
2025-05-07T20:32:37.4625616Z     
2025-05-07T20:32:37.4625791Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4625885Z             op = silu_mul_quant
2025-05-07T20:32:37.4625970Z             if compiled:
2025-05-07T20:32:37.4626069Z                 op = torch.compile(op)
2025-05-07T20:32:37.4626230Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4626303Z     
2025-05-07T20:32:37.4626395Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4626399Z 
2025-05-07T20:32:37.4626500Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4626626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4626732Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4626837Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4627347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4627494Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4627861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4628085Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4628438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4628533Z     kernel = self.compile(
2025-05-07T20:32:37.4628924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4629102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4629229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4629233Z 
2025-05-07T20:32:37.4629448Z self = <triton.compiler.compiler.ASTSource object at 0x7fa898f46890>
2025-05-07T20:32:37.4630239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4630751Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8993ed7e0>}
2025-05-07T20:32:37.4631511Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4631703Z context = <triton._C.libtriton.ir.context object at 0x7fa898a696f0>
2025-05-07T20:32:37.4631715Z 
2025-05-07T20:32:37.4631946Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4632217Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4632333Z                            module_map=module_map)
2025-05-07T20:32:37.4632496Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4632596Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4632676Z E       ^
2025-05-07T20:32:37.4633037Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4633044Z 
2025-05-07T20:32:37.4633464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4633469Z 
2025-05-07T20:32:37.4633575Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4633800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4633885Z     T=4096,
2025-05-07T20:32:37.4633961Z     D=5120,
2025-05-07T20:32:37.4634045Z     scale_ub=1200.0,
2025-05-07T20:32:37.4634133Z     contiguous=True,
2025-05-07T20:32:37.4634220Z     compiled=False,
2025-05-07T20:32:37.4634292Z )
2025-05-07T20:32:37.4634557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4634735Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.4634740Z 
2025-05-07T20:32:37.4634860Z     @given(
2025-05-07T20:32:37.4634980Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4635084Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4635210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4635328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4635444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4635522Z     )
2025-05-07T20:32:37.4635772Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4635869Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4635945Z         self,
2025-05-07T20:32:37.4636066Z         T: int,
2025-05-07T20:32:37.4636147Z         D: int,
2025-05-07T20:32:37.4636249Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4636338Z         contiguous: bool,
2025-05-07T20:32:37.4636428Z         compiled: bool,
2025-05-07T20:32:37.4636505Z     ) -> None:
2025-05-07T20:32:37.4636600Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4636681Z     
2025-05-07T20:32:37.4636850Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4636923Z     
2025-05-07T20:32:37.4637022Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4637148Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4637236Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4637321Z         x0 = x[:, :D]
2025-05-07T20:32:37.4637402Z         x1 = x[:, D:]
2025-05-07T20:32:37.4637480Z     
2025-05-07T20:32:37.4637564Z         if contiguous:
2025-05-07T20:32:37.4637655Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4637750Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4637825Z     
2025-05-07T20:32:37.4637916Z         if scale_ub is not None:
2025-05-07T20:32:37.4638029Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4638164Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4638239Z             )
2025-05-07T20:32:37.4638331Z         else:
2025-05-07T20:32:37.4638444Z             scale_ub_tensor = None
2025-05-07T20:32:37.4638532Z     
2025-05-07T20:32:37.4638675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4638763Z             op = silu_mul_quant
2025-05-07T20:32:37.4638851Z             if compiled:
2025-05-07T20:32:37.4638949Z                 op = torch.compile(op)
2025-05-07T20:32:37.4639056Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4639132Z     
2025-05-07T20:32:37.4639267Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4639273Z 
2025-05-07T20:32:37.4639371Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4639506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4639614Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4639719Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4640230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4640330Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4640696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4640917Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4641260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4641360Z     kernel = self.compile(
2025-05-07T20:32:37.4641746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4641923Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4642093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4642098Z 
2025-05-07T20:32:37.4642303Z self = <triton.compiler.compiler.ASTSource object at 0x7fa898b664a0>
2025-05-07T20:32:37.4643131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4643639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8993edf30>}
2025-05-07T20:32:37.4644399Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4644635Z context = <triton._C.libtriton.ir.context object at 0x7fa898ac28f0>
2025-05-07T20:32:37.4644640Z 
2025-05-07T20:32:37.4644806Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4645076Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4645218Z                            module_map=module_map)
2025-05-07T20:32:37.4645451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4645578Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4645655Z E       ^
2025-05-07T20:32:37.4646020Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4646025Z 
2025-05-07T20:32:37.4646448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4646456Z 
2025-05-07T20:32:37.4646565Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4646792Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4646868Z     T=1,
2025-05-07T20:32:37.4646949Z     D=5120,
2025-05-07T20:32:37.4647031Z     scale_ub=None,
2025-05-07T20:32:37.4647114Z     contiguous=True,
2025-05-07T20:32:37.4647204Z     compiled=True,
2025-05-07T20:32:37.4647282Z )
2025-05-07T20:32:37.4647502Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4647668Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.4647672Z 
2025-05-07T20:32:37.4647746Z     @given(
2025-05-07T20:32:37.4647864Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4648022Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4648138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4648256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4648371Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4648442Z     )
2025-05-07T20:32:37.4648733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4648840Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4648914Z         self,
2025-05-07T20:32:37.4648993Z         T: int,
2025-05-07T20:32:37.4649068Z         D: int,
2025-05-07T20:32:37.4649165Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4649255Z         contiguous: bool,
2025-05-07T20:32:37.4649341Z         compiled: bool,
2025-05-07T20:32:37.4649420Z     ) -> None:
2025-05-07T20:32:37.4649512Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4649581Z     
2025-05-07T20:32:37.4649754Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4649828Z     
2025-05-07T20:32:37.4649920Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4650048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4650137Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4650215Z         x0 = x[:, :D]
2025-05-07T20:32:37.4650341Z         x1 = x[:, D:]
2025-05-07T20:32:37.4650413Z     
2025-05-07T20:32:37.4650495Z         if contiguous:
2025-05-07T20:32:37.4650586Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4650673Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4650786Z     
2025-05-07T20:32:37.4650880Z         if scale_ub is not None:
2025-05-07T20:32:37.4650983Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4651122Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4651196Z             )
2025-05-07T20:32:37.4651269Z         else:
2025-05-07T20:32:37.4651369Z             scale_ub_tensor = None
2025-05-07T20:32:37.4651439Z     
2025-05-07T20:32:37.4651569Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4651660Z             op = silu_mul_quant
2025-05-07T20:32:37.4651741Z             if compiled:
2025-05-07T20:32:37.4651883Z                 op = torch.compile(op)
2025-05-07T20:32:37.4651992Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4652065Z     
2025-05-07T20:32:37.4652155Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4652280Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4652349Z     
2025-05-07T20:32:37.4652491Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4652591Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4652689Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4652812Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4652954Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4653023Z     
2025-05-07T20:32:37.4653126Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4653133Z 
2025-05-07T20:32:37.4653229Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4653354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4653466Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4653602Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4654171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4654273Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4654636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4654873Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4655245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4655777Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4656204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4656462Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4656842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4657017Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4657370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4657448Z     fn()
2025-05-07T20:32:37.4670304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4670407Z     self.fn.run(
2025-05-07T20:32:37.4670774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4670878Z     kernel = self.compile(
2025-05-07T20:32:37.4671276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4671566Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4671695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4671701Z 
2025-05-07T20:32:37.4671966Z self = <triton.compiler.compiler.ASTSource object at 0x7fa898e6ae90>
2025-05-07T20:32:37.4672756Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4673268Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8993ef0a0>}
2025-05-07T20:32:37.4674031Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4674282Z context = <triton._C.libtriton.ir.context object at 0x7fa8986fec70>
2025-05-07T20:32:37.4674288Z 
2025-05-07T20:32:37.4674456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4674726Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4674833Z                            module_map=module_map)
2025-05-07T20:32:37.4675002Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4675103Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4675176Z E       ^
2025-05-07T20:32:37.4675543Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4675548Z 
2025-05-07T20:32:37.4675970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4675978Z 
2025-05-07T20:32:37.4676093Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4676314Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4676389Z     T=2048,
2025-05-07T20:32:37.4676467Z     D=5120,
2025-05-07T20:32:37.4676547Z     scale_ub=None,
2025-05-07T20:32:37.4676630Z     contiguous=True,
2025-05-07T20:32:37.4676721Z     compiled=True,
2025-05-07T20:32:37.4676792Z )
2025-05-07T20:32:37.4677014Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4677187Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.4677192Z 
2025-05-07T20:32:37.4677266Z     @given(
2025-05-07T20:32:37.4677449Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4677550Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4677667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4677792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4677906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4677978Z     )
2025-05-07T20:32:37.4678227Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4678322Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4678396Z         self,
2025-05-07T20:32:37.4678471Z         T: int,
2025-05-07T20:32:37.4678561Z         D: int,
2025-05-07T20:32:37.4679255Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4679346Z         contiguous: bool,
2025-05-07T20:32:37.4679430Z         compiled: bool,
2025-05-07T20:32:37.4679513Z     ) -> None:
2025-05-07T20:32:37.4679606Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4679677Z     
2025-05-07T20:32:37.4679856Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4679929Z     
2025-05-07T20:32:37.4680019Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4680155Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4680290Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4680375Z         x0 = x[:, :D]
2025-05-07T20:32:37.4680452Z         x1 = x[:, D:]
2025-05-07T20:32:37.4680525Z     
2025-05-07T20:32:37.4680614Z         if contiguous:
2025-05-07T20:32:37.4680747Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4680835Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4680910Z     
2025-05-07T20:32:37.4681000Z         if scale_ub is not None:
2025-05-07T20:32:37.4681108Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4681248Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4681320Z             )
2025-05-07T20:32:37.4681394Z         else:
2025-05-07T20:32:37.4681494Z             scale_ub_tensor = None
2025-05-07T20:32:37.4681568Z     
2025-05-07T20:32:37.4681699Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4681834Z             op = silu_mul_quant
2025-05-07T20:32:37.4681915Z             if compiled:
2025-05-07T20:32:37.4682019Z                 op = torch.compile(op)
2025-05-07T20:32:37.4682123Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4682198Z     
2025-05-07T20:32:37.4682291Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4682415Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4682486Z     
2025-05-07T20:32:37.4682628Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4682729Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4682829Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4682965Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4683108Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4683182Z     
2025-05-07T20:32:37.4683282Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4683287Z 
2025-05-07T20:32:37.4683388Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4683521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4683626Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4683763Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4684335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4684442Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4684806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4685032Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4685450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4685713Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4686120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4686371Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4686755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4686925Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4687273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4687350Z     fn()
2025-05-07T20:32:37.4687754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4687840Z     self.fn.run(
2025-05-07T20:32:37.4688189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4688286Z     kernel = self.compile(
2025-05-07T20:32:37.4688713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4688891Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4689079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4689084Z 
2025-05-07T20:32:37.4689295Z self = <triton.compiler.compiler.ASTSource object at 0x7fa8986860b0>
2025-05-07T20:32:37.4690084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4690601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa898eabbe0>}
2025-05-07T20:32:37.4691402Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4691596Z context = <triton._C.libtriton.ir.context object at 0x7fa898887db0>
2025-05-07T20:32:37.4691605Z 
2025-05-07T20:32:37.4691777Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4692042Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4692151Z                            module_map=module_map)
2025-05-07T20:32:37.4692315Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4692421Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4692504Z E       ^
2025-05-07T20:32:37.4692864Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4692872Z 
2025-05-07T20:32:37.4693289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4693297Z 
2025-05-07T20:32:37.4693403Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4693625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4693708Z     T=128,
2025-05-07T20:32:37.4693785Z     D=5120,
2025-05-07T20:32:37.4693867Z     scale_ub=None,
2025-05-07T20:32:37.4693957Z     contiguous=True,
2025-05-07T20:32:37.4694040Z     compiled=True,
2025-05-07T20:32:37.4694113Z )
2025-05-07T20:32:37.4694338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4694546Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.4694551Z 
2025-05-07T20:32:37.4694632Z     @given(
2025-05-07T20:32:37.4694753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4694855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4694977Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4695095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4695210Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4695289Z     )
2025-05-07T20:32:37.4695539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4695633Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4695713Z         self,
2025-05-07T20:32:37.4695790Z         T: int,
2025-05-07T20:32:37.4695864Z         D: int,
2025-05-07T20:32:37.4695969Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4696061Z         contiguous: bool,
2025-05-07T20:32:37.4696151Z         compiled: bool,
2025-05-07T20:32:37.4696232Z     ) -> None:
2025-05-07T20:32:37.4696328Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4696404Z     
2025-05-07T20:32:37.4696572Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4696649Z     
2025-05-07T20:32:37.4696744Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4696912Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4697003Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4697087Z         x0 = x[:, :D]
2025-05-07T20:32:37.4697167Z         x1 = x[:, D:]
2025-05-07T20:32:37.4697279Z     
2025-05-07T20:32:37.4697367Z         if contiguous:
2025-05-07T20:32:37.4697459Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4697553Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4697629Z     
2025-05-07T20:32:37.4697719Z         if scale_ub is not None:
2025-05-07T20:32:37.4697829Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4697963Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4698118Z             )
2025-05-07T20:32:37.4698200Z         else:
2025-05-07T20:32:37.4698298Z             scale_ub_tensor = None
2025-05-07T20:32:37.4698439Z     
2025-05-07T20:32:37.4698591Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4698693Z             op = silu_mul_quant
2025-05-07T20:32:37.4698777Z             if compiled:
2025-05-07T20:32:37.4698879Z                 op = torch.compile(op)
2025-05-07T20:32:37.4698985Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4699062Z     
2025-05-07T20:32:37.4699153Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4699274Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4699348Z     
2025-05-07T20:32:37.4699484Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4699583Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4699686Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4699810Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4699952Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4700031Z     
2025-05-07T20:32:37.4700134Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4700139Z 
2025-05-07T20:32:37.4700243Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4700369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4700474Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4700617Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4701185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4701287Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4701655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4701922Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4702298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4702556Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4702958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4703217Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4703596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4703771Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4704114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4704193Z     fn()
2025-05-07T20:32:37.4704599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4704682Z     self.fn.run(
2025-05-07T20:32:37.4705063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4705164Z     kernel = self.compile(
2025-05-07T20:32:37.4705546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4705766Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4705889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4705893Z 
2025-05-07T20:32:37.4706099Z self = <triton.compiler.compiler.ASTSource object at 0x7fa898a926b0>
2025-05-07T20:32:37.4706895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4707406Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89804c280>}
2025-05-07T20:32:37.4708205Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4708404Z context = <triton._C.libtriton.ir.context object at 0x7fa7adfa4130>
2025-05-07T20:32:37.4708409Z 
2025-05-07T20:32:37.4708592Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4708891Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4709000Z                            module_map=module_map)
2025-05-07T20:32:37.4709169Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4709272Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4709348Z E       ^
2025-05-07T20:32:37.4709712Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4709719Z 
2025-05-07T20:32:37.4710135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4710140Z 
2025-05-07T20:32:37.4710251Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4710473Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4710551Z     T=4096,
2025-05-07T20:32:37.4710631Z     D=5120,
2025-05-07T20:32:37.4710713Z     scale_ub=None,
2025-05-07T20:32:37.4710799Z     contiguous=True,
2025-05-07T20:32:37.4710888Z     compiled=True,
2025-05-07T20:32:37.4710959Z )
2025-05-07T20:32:37.4711221Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4711396Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.4711401Z 
2025-05-07T20:32:37.4711481Z     @given(
2025-05-07T20:32:37.4711608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4711710Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4711825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4711948Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4712065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4712141Z     )
2025-05-07T20:32:37.4712392Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4712485Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4712562Z         self,
2025-05-07T20:32:37.4712640Z         T: int,
2025-05-07T20:32:37.4712715Z         D: int,
2025-05-07T20:32:37.4712812Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4712909Z         contiguous: bool,
2025-05-07T20:32:37.4712994Z         compiled: bool,
2025-05-07T20:32:37.4713078Z     ) -> None:
2025-05-07T20:32:37.4713172Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4713248Z     
2025-05-07T20:32:37.4713538Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4713611Z     
2025-05-07T20:32:37.4713702Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4713829Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4713957Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4714037Z         x0 = x[:, :D]
2025-05-07T20:32:37.4714119Z         x1 = x[:, D:]
2025-05-07T20:32:37.4714189Z     
2025-05-07T20:32:37.4714273Z         if contiguous:
2025-05-07T20:32:37.4714367Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4714453Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4714528Z     
2025-05-07T20:32:37.4714619Z         if scale_ub is not None:
2025-05-07T20:32:37.4714727Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4714864Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4714940Z             )
2025-05-07T20:32:37.4715062Z         else:
2025-05-07T20:32:37.4715164Z             scale_ub_tensor = None
2025-05-07T20:32:37.4715236Z     
2025-05-07T20:32:37.4715367Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4715460Z             op = silu_mul_quant
2025-05-07T20:32:37.4715544Z             if compiled:
2025-05-07T20:32:37.4715645Z                 op = torch.compile(op)
2025-05-07T20:32:37.4715756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4715828Z     
2025-05-07T20:32:37.4715918Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4716042Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4716113Z     
2025-05-07T20:32:37.4716251Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4716353Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4716453Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4716579Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4716726Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4716798Z     
2025-05-07T20:32:37.4716902Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4716906Z 
2025-05-07T20:32:37.4717005Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4717132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4717243Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4717380Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4717946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4718049Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4718454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4718685Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4719061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4719323Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4719722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4719976Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4720358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4720525Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4720874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4720950Z     fn()
2025-05-07T20:32:37.4721354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4721442Z     self.fn.run(
2025-05-07T20:32:37.4721845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4721941Z     kernel = self.compile(
2025-05-07T20:32:37.4722328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4722542Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4722669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4722674Z 
2025-05-07T20:32:37.4722881Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7adf038e0>
2025-05-07T20:32:37.4723667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4724225Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89804d2d0>}
2025-05-07T20:32:37.4724981Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4725179Z context = <triton._C.libtriton.ir.context object at 0x7fa8983a2f70>
2025-05-07T20:32:37.4725184Z 
2025-05-07T20:32:37.4725350Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4725615Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4725730Z                            module_map=module_map)
2025-05-07T20:32:37.4725894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4726002Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4726078Z E       ^
2025-05-07T20:32:37.4726434Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4726439Z 
2025-05-07T20:32:37.4726861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4726869Z 
2025-05-07T20:32:37.4726975Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4727203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4727279Z     T=16384,
2025-05-07T20:32:37.4727355Z     D=5120,
2025-05-07T20:32:37.4727442Z     scale_ub=None,
2025-05-07T20:32:37.4727527Z     contiguous=True,
2025-05-07T20:32:37.4727654Z     compiled=True,
2025-05-07T20:32:37.4727732Z )
2025-05-07T20:32:37.4727955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4728135Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.4728139Z 
2025-05-07T20:32:37.4728222Z     @given(
2025-05-07T20:32:37.4728340Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4728472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4728600Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4728736Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4728850Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4728924Z     )
2025-05-07T20:32:37.4729174Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4729276Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4729350Z         self,
2025-05-07T20:32:37.4729426Z         T: int,
2025-05-07T20:32:37.4729509Z         D: int,
2025-05-07T20:32:37.4729606Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4729693Z         contiguous: bool,
2025-05-07T20:32:37.4729784Z         compiled: bool,
2025-05-07T20:32:37.4729862Z     ) -> None:
2025-05-07T20:32:37.4730001Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4730073Z     
2025-05-07T20:32:37.4730243Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4730320Z     
2025-05-07T20:32:37.4730412Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4730579Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4730670Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4730748Z         x0 = x[:, :D]
2025-05-07T20:32:37.4730826Z         x1 = x[:, D:]
2025-05-07T20:32:37.4730903Z     
2025-05-07T20:32:37.4730987Z         if contiguous:
2025-05-07T20:32:37.4731079Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4731174Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4731246Z     
2025-05-07T20:32:37.4731339Z         if scale_ub is not None:
2025-05-07T20:32:37.4731449Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4731583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4731706Z             )
2025-05-07T20:32:37.4731784Z         else:
2025-05-07T20:32:37.4731882Z             scale_ub_tensor = None
2025-05-07T20:32:37.4731953Z     
2025-05-07T20:32:37.4732084Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4732178Z             op = silu_mul_quant
2025-05-07T20:32:37.4732265Z             if compiled:
2025-05-07T20:32:37.4732364Z                 op = torch.compile(op)
2025-05-07T20:32:37.4732473Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4732544Z     
2025-05-07T20:32:37.4732635Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4732761Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4732831Z     
2025-05-07T20:32:37.4732968Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4733073Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4733172Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4733302Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4733446Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4733519Z     
2025-05-07T20:32:37.4733623Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4733628Z 
2025-05-07T20:32:37.4733727Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4733856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4733968Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4734105Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4734672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4734821Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4735187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4735424Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4735797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4736053Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4736461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4736714Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4737094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4737264Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4737608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4737689Z     fn()
2025-05-07T20:32:37.4738185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4738273Z     self.fn.run(
2025-05-07T20:32:37.4738628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4738778Z     kernel = self.compile(
2025-05-07T20:32:37.4739185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4739358Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4739481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4739486Z 
2025-05-07T20:32:37.4739701Z self = <triton.compiler.compiler.ASTSource object at 0x7fa8982202b0>
2025-05-07T20:32:37.4740487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4741043Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89806df30>}
2025-05-07T20:32:37.4741801Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4741997Z context = <triton._C.libtriton.ir.context object at 0x7fa7adb6f370>
2025-05-07T20:32:37.4742002Z 
2025-05-07T20:32:37.4742166Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4742433Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4742544Z                            module_map=module_map)
2025-05-07T20:32:37.4742712Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4742816Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4742896Z E       ^
2025-05-07T20:32:37.4743254Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4743261Z 
2025-05-07T20:32:37.4743685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4743690Z 
2025-05-07T20:32:37.4743796Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4744020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4744102Z     T=1,
2025-05-07T20:32:37.4744178Z     D=5120,
2025-05-07T20:32:37.4744302Z     scale_ub=1200.0,
2025-05-07T20:32:37.4744397Z     contiguous=True,
2025-05-07T20:32:37.4744480Z     compiled=True,
2025-05-07T20:32:37.4744557Z )
2025-05-07T20:32:37.4744779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4744945Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.4744950Z 
2025-05-07T20:32:37.4745030Z     @given(
2025-05-07T20:32:37.4745150Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4745252Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4745374Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4745492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4745606Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4745687Z     )
2025-05-07T20:32:37.4745933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4746037Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4746112Z         self,
2025-05-07T20:32:37.4746187Z         T: int,
2025-05-07T20:32:37.4746264Z         D: int,
2025-05-07T20:32:37.4746366Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4746454Z         contiguous: bool,
2025-05-07T20:32:37.4746589Z         compiled: bool,
2025-05-07T20:32:37.4746669Z     ) -> None:
2025-05-07T20:32:37.4746763Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4746838Z     
2025-05-07T20:32:37.4747005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4747118Z     
2025-05-07T20:32:37.4747215Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4747341Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4747432Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4747511Z         x0 = x[:, :D]
2025-05-07T20:32:37.4747591Z         x1 = x[:, D:]
2025-05-07T20:32:37.4747669Z     
2025-05-07T20:32:37.4747757Z         if contiguous:
2025-05-07T20:32:37.4747849Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4747942Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4748015Z     
2025-05-07T20:32:37.4748105Z         if scale_ub is not None:
2025-05-07T20:32:37.4748255Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4748391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4748468Z             )
2025-05-07T20:32:37.4748549Z         else:
2025-05-07T20:32:37.4748646Z             scale_ub_tensor = None
2025-05-07T20:32:37.4748719Z     
2025-05-07T20:32:37.4748856Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4748946Z             op = silu_mul_quant
2025-05-07T20:32:37.4749034Z             if compiled:
2025-05-07T20:32:37.4749132Z                 op = torch.compile(op)
2025-05-07T20:32:37.4749236Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4749314Z     
2025-05-07T20:32:37.4749405Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4749410Z 
2025-05-07T20:32:37.4749513Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4749642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4749744Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4749847Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4750225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.4750318Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.4750821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4750921Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4751280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4751511Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4751920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4752018Z     kernel = self.compile(
2025-05-07T20:32:37.4752403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4752584Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4752713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4752718Z 
2025-05-07T20:32:37.4752924Z self = <triton.compiler.compiler.ASTSource object at 0x7fa8989b4ee0>
2025-05-07T20:32:37.4753713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4754228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89829b1c0>}
2025-05-07T20:32:37.4755025Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4755229Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad5c4370>
2025-05-07T20:32:37.4755233Z 
2025-05-07T20:32:37.4755400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4755994Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4756126Z                            module_map=module_map)
2025-05-07T20:32:37.4756288Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4756392Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4756468Z E       ^
2025-05-07T20:32:37.4756829Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4756839Z 
2025-05-07T20:32:37.4757256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4757346Z 
2025-05-07T20:32:37.4757454Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4757682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4757762Z     T=1,
2025-05-07T20:32:37.4757836Z     D=5120,
2025-05-07T20:32:37.4757924Z     scale_ub=None,
2025-05-07T20:32:37.4758009Z     contiguous=False,
2025-05-07T20:32:37.4758091Z     compiled=True,
2025-05-07T20:32:37.4758170Z )
2025-05-07T20:32:37.4758396Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4758589Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.4758595Z 
2025-05-07T20:32:37.4758688Z     @given(
2025-05-07T20:32:37.4758809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4758914Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4759031Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4759153Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4759272Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4759347Z     )
2025-05-07T20:32:37.4759595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4759694Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4759776Z         self,
2025-05-07T20:32:37.4759855Z         T: int,
2025-05-07T20:32:37.4759930Z         D: int,
2025-05-07T20:32:37.4760033Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4760128Z         contiguous: bool,
2025-05-07T20:32:37.4760214Z         compiled: bool,
2025-05-07T20:32:37.4760293Z     ) -> None:
2025-05-07T20:32:37.4760393Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4760464Z     
2025-05-07T20:32:37.4760701Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4760778Z     
2025-05-07T20:32:37.4760869Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4760995Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4761087Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4761167Z         x0 = x[:, :D]
2025-05-07T20:32:37.4761252Z         x1 = x[:, D:]
2025-05-07T20:32:37.4761322Z     
2025-05-07T20:32:37.4761407Z         if contiguous:
2025-05-07T20:32:37.4761501Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4761592Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4761664Z     
2025-05-07T20:32:37.4761758Z         if scale_ub is not None:
2025-05-07T20:32:37.4761865Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4761999Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4762079Z             )
2025-05-07T20:32:37.4762155Z         else:
2025-05-07T20:32:37.4762252Z             scale_ub_tensor = None
2025-05-07T20:32:37.4762326Z     
2025-05-07T20:32:37.4762457Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4762546Z             op = silu_mul_quant
2025-05-07T20:32:37.4762637Z             if compiled:
2025-05-07T20:32:37.4762799Z                 op = torch.compile(op)
2025-05-07T20:32:37.4762909Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4762979Z     
2025-05-07T20:32:37.4763068Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4763195Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4763321Z     
2025-05-07T20:32:37.4763461Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4763567Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4763667Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4763788Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4763934Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4764010Z     
2025-05-07T20:32:37.4764113Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4764118Z 
2025-05-07T20:32:37.4764216Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4764384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4764498Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4764634Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4765200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4765307Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4765669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4765900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4766272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4766530Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4766941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4767194Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4767574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4767745Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4768087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4768167Z     fn()
2025-05-07T20:32:37.4768570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4768714Z     self.fn.run(
2025-05-07T20:32:37.4769091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4769188Z     kernel = self.compile(
2025-05-07T20:32:37.4769577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4769753Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4769878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4769885Z 
2025-05-07T20:32:37.4770095Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ade8fa30>
2025-05-07T20:32:37.4770883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4771483Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8984d3d00>}
2025-05-07T20:32:37.4772331Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4772527Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad544cb0>
2025-05-07T20:32:37.4772578Z 
2025-05-07T20:32:37.4772746Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4773012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4773125Z                            module_map=module_map)
2025-05-07T20:32:37.4773285Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4773388Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4773470Z E       ^
2025-05-07T20:32:37.4773828Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4773875Z 
2025-05-07T20:32:37.4774301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4774306Z 
2025-05-07T20:32:37.4774411Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4774635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4774717Z     T=1,
2025-05-07T20:32:37.4774792Z     D=5120,
2025-05-07T20:32:37.4774874Z     scale_ub=None,
2025-05-07T20:32:37.4774962Z     contiguous=True,
2025-05-07T20:32:37.4775047Z     compiled=False,
2025-05-07T20:32:37.4775120Z )
2025-05-07T20:32:37.4775345Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4775507Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.4775516Z 
2025-05-07T20:32:37.4775601Z     @given(
2025-05-07T20:32:37.4775720Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4775822Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4775942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4776065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4776178Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4776255Z     )
2025-05-07T20:32:37.4776501Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4776597Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4776677Z         self,
2025-05-07T20:32:37.4776752Z         T: int,
2025-05-07T20:32:37.4776830Z         D: int,
2025-05-07T20:32:37.4776930Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4777019Z         contiguous: bool,
2025-05-07T20:32:37.4777108Z         compiled: bool,
2025-05-07T20:32:37.4777187Z     ) -> None:
2025-05-07T20:32:37.4777329Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4777402Z     
2025-05-07T20:32:37.4777571Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4777645Z     
2025-05-07T20:32:37.4777740Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4777865Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4777951Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4778096Z         x0 = x[:, :D]
2025-05-07T20:32:37.4778177Z         x1 = x[:, D:]
2025-05-07T20:32:37.4778255Z     
2025-05-07T20:32:37.4778340Z         if contiguous:
2025-05-07T20:32:37.4778430Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4778523Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4778594Z     
2025-05-07T20:32:37.4778683Z         if scale_ub is not None:
2025-05-07T20:32:37.4778791Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4778923Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4779000Z             )
2025-05-07T20:32:37.4779078Z         else:
2025-05-07T20:32:37.4779174Z             scale_ub_tensor = None
2025-05-07T20:32:37.4779244Z     
2025-05-07T20:32:37.4779382Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4779471Z             op = silu_mul_quant
2025-05-07T20:32:37.4779601Z             if compiled:
2025-05-07T20:32:37.4779707Z                 op = torch.compile(op)
2025-05-07T20:32:37.4779811Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4779884Z     
2025-05-07T20:32:37.4780015Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4780020Z 
2025-05-07T20:32:37.4780117Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4780250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4780349Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4780449Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4780964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4781061Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4781425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4781717Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4782062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4782168Z     kernel = self.compile(
2025-05-07T20:32:37.4782553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4782729Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4782857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4782862Z 
2025-05-07T20:32:37.4783070Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7adc97eb0>
2025-05-07T20:32:37.4783863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4784373Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8984d1ea0>}
2025-05-07T20:32:37.4785135Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4785329Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad5e1cf0>
2025-05-07T20:32:37.4785333Z 
2025-05-07T20:32:37.4785499Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4785810Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4785920Z                            module_map=module_map)
2025-05-07T20:32:37.4786089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4786189Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4786266Z E       ^
2025-05-07T20:32:37.4786627Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4786634Z 
2025-05-07T20:32:37.4787051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4787055Z 
2025-05-07T20:32:37.4787160Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4787386Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4787460Z     T=128,
2025-05-07T20:32:37.4787537Z     D=5120,
2025-05-07T20:32:37.4787617Z     scale_ub=None,
2025-05-07T20:32:37.4787705Z     contiguous=False,
2025-05-07T20:32:37.4787793Z     compiled=True,
2025-05-07T20:32:37.4787862Z )
2025-05-07T20:32:37.4788080Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4788315Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.4788321Z 
2025-05-07T20:32:37.4788401Z     @given(
2025-05-07T20:32:37.4788543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4788645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4788801Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4788923Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4789035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4789106Z     )
2025-05-07T20:32:37.4789357Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4789450Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4789531Z         self,
2025-05-07T20:32:37.4789608Z         T: int,
2025-05-07T20:32:37.4789682Z         D: int,
2025-05-07T20:32:37.4789780Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4789914Z         contiguous: bool,
2025-05-07T20:32:37.4790000Z         compiled: bool,
2025-05-07T20:32:37.4790078Z     ) -> None:
2025-05-07T20:32:37.4790175Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4790244Z     
2025-05-07T20:32:37.4790417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4790491Z     
2025-05-07T20:32:37.4790585Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4790711Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4790798Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4790874Z         x0 = x[:, :D]
2025-05-07T20:32:37.4790954Z         x1 = x[:, D:]
2025-05-07T20:32:37.4791023Z     
2025-05-07T20:32:37.4791104Z         if contiguous:
2025-05-07T20:32:37.4791197Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4791287Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4791357Z     
2025-05-07T20:32:37.4791449Z         if scale_ub is not None:
2025-05-07T20:32:37.4791555Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4791693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4791766Z             )
2025-05-07T20:32:37.4791840Z         else:
2025-05-07T20:32:37.4791936Z             scale_ub_tensor = None
2025-05-07T20:32:37.4792007Z     
2025-05-07T20:32:37.4792136Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4792231Z             op = silu_mul_quant
2025-05-07T20:32:37.4792316Z             if compiled:
2025-05-07T20:32:37.4792414Z                 op = torch.compile(op)
2025-05-07T20:32:37.4792521Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4792593Z     
2025-05-07T20:32:37.4792682Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4792686Z 
2025-05-07T20:32:37.4792786Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4792960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4796911Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4797030Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4797414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.4797511Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.4798008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4798110Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4798474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4798734Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4799086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4799175Z     kernel = self.compile(
2025-05-07T20:32:37.4799562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4799805Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4799930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4799936Z 
2025-05-07T20:32:37.4800145Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad59fe50>
2025-05-07T20:32:37.4800973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4801488Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8984d0dc0>}
2025-05-07T20:32:37.4802245Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4802478Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad443230>
2025-05-07T20:32:37.4802483Z 
2025-05-07T20:32:37.4802652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4802919Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4803035Z                            module_map=module_map)
2025-05-07T20:32:37.4803194Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4803291Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4803368Z E       ^
2025-05-07T20:32:37.4803728Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4803733Z 
2025-05-07T20:32:37.4804151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4804163Z 
2025-05-07T20:32:37.4804266Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4804490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4804570Z     T=128,
2025-05-07T20:32:37.4804643Z     D=7168,
2025-05-07T20:32:37.4804724Z     scale_ub=1200.0,
2025-05-07T20:32:37.4804814Z     contiguous=False,
2025-05-07T20:32:37.4804896Z     compiled=False,
2025-05-07T20:32:37.4804967Z )
2025-05-07T20:32:37.4805188Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4805360Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.4805365Z 
2025-05-07T20:32:37.4805442Z     @given(
2025-05-07T20:32:37.4805601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4805700Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4805817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4805937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4806054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4806129Z     )
2025-05-07T20:32:37.4806375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4806468Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4806551Z         self,
2025-05-07T20:32:37.4806625Z         T: int,
2025-05-07T20:32:37.4806699Z         D: int,
2025-05-07T20:32:37.4806798Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4806884Z         contiguous: bool,
2025-05-07T20:32:37.4806968Z         compiled: bool,
2025-05-07T20:32:37.4807044Z     ) -> None:
2025-05-07T20:32:37.4807136Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4807208Z     
2025-05-07T20:32:37.4807379Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4807455Z     
2025-05-07T20:32:37.4807549Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4807676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4807762Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4807889Z         x0 = x[:, :D]
2025-05-07T20:32:37.4807967Z         x1 = x[:, D:]
2025-05-07T20:32:37.4808037Z     
2025-05-07T20:32:37.4808121Z         if contiguous:
2025-05-07T20:32:37.4808214Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4808350Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4808419Z     
2025-05-07T20:32:37.4808507Z         if scale_ub is not None:
2025-05-07T20:32:37.4808618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4808752Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4808825Z             )
2025-05-07T20:32:37.4808905Z         else:
2025-05-07T20:32:37.4808996Z             scale_ub_tensor = None
2025-05-07T20:32:37.4809074Z     
2025-05-07T20:32:37.4809209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4809296Z             op = silu_mul_quant
2025-05-07T20:32:37.4809421Z             if compiled:
2025-05-07T20:32:37.4809528Z                 op = torch.compile(op)
2025-05-07T20:32:37.4809634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4809711Z     
2025-05-07T20:32:37.4809799Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4809803Z 
2025-05-07T20:32:37.4809899Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4810029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4810128Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4810226Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4810737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4810832Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4811201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4811422Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4811768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4811866Z     kernel = self.compile(
2025-05-07T20:32:37.4812252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4812428Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4812554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4812558Z 
2025-05-07T20:32:37.4812763Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad4dc6a0>
2025-05-07T20:32:37.4813591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4814103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa89806e200>}
2025-05-07T20:32:37.4814863Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4815062Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad422fb0>
2025-05-07T20:32:37.4815067Z 
2025-05-07T20:32:37.4815231Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4815499Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4815606Z                            module_map=module_map)
2025-05-07T20:32:37.4815766Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4815868Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4815946Z E       ^
2025-05-07T20:32:37.4816350Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4816355Z 
2025-05-07T20:32:37.4816771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4816839Z 
2025-05-07T20:32:37.4816943Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4817169Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4817243Z     T=128,
2025-05-07T20:32:37.4817321Z     D=5120,
2025-05-07T20:32:37.4817401Z     scale_ub=None,
2025-05-07T20:32:37.4817486Z     contiguous=False,
2025-05-07T20:32:37.4817568Z     compiled=False,
2025-05-07T20:32:37.4817643Z )
2025-05-07T20:32:37.4817864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4818147Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.4818200Z 
2025-05-07T20:32:37.4818281Z     @given(
2025-05-07T20:32:37.4818414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4818526Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4818668Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4818787Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4818905Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4818976Z     )
2025-05-07T20:32:37.4819221Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4819317Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4819390Z         self,
2025-05-07T20:32:37.4819464Z         T: int,
2025-05-07T20:32:37.4819541Z         D: int,
2025-05-07T20:32:37.4819639Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4819727Z         contiguous: bool,
2025-05-07T20:32:37.4819811Z         compiled: bool,
2025-05-07T20:32:37.4819890Z     ) -> None:
2025-05-07T20:32:37.4819982Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4820055Z     
2025-05-07T20:32:37.4820226Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4820297Z     
2025-05-07T20:32:37.4820392Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4820514Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4820607Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4820683Z         x0 = x[:, :D]
2025-05-07T20:32:37.4820759Z         x1 = x[:, D:]
2025-05-07T20:32:37.4820834Z     
2025-05-07T20:32:37.4820914Z         if contiguous:
2025-05-07T20:32:37.4821002Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4821091Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4821163Z     
2025-05-07T20:32:37.4821297Z         if scale_ub is not None:
2025-05-07T20:32:37.4821405Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4821539Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4821614Z             )
2025-05-07T20:32:37.4821692Z         else:
2025-05-07T20:32:37.4821787Z             scale_ub_tensor = None
2025-05-07T20:32:37.4821859Z     
2025-05-07T20:32:37.4821989Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4822076Z             op = silu_mul_quant
2025-05-07T20:32:37.4822162Z             if compiled:
2025-05-07T20:32:37.4822259Z                 op = torch.compile(op)
2025-05-07T20:32:37.4822362Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4822436Z     
2025-05-07T20:32:37.4822524Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4822528Z 
2025-05-07T20:32:37.4822623Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4822750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4822851Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4822954Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4823456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4823598Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4823963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4824184Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4824569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4824668Z     kernel = self.compile(
2025-05-07T20:32:37.4825051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4825228Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4825352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4825357Z 
2025-05-07T20:32:37.4825604Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad48b0d0>
2025-05-07T20:32:37.4826393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4826903Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa898626a70>}
2025-05-07T20:32:37.4827665Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4827859Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad852c70>
2025-05-07T20:32:37.4827863Z 
2025-05-07T20:32:37.4828030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4828295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4828403Z                            module_map=module_map)
2025-05-07T20:32:37.4828580Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4828690Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4828780Z E       ^
2025-05-07T20:32:37.4829150Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4829155Z 
2025-05-07T20:32:37.4829571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4829575Z 
2025-05-07T20:32:37.4829680Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4829946Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4830021Z     T=128,
2025-05-07T20:32:37.4830099Z     D=5120,
2025-05-07T20:32:37.4830183Z     scale_ub=1200.0,
2025-05-07T20:32:37.4830267Z     contiguous=True,
2025-05-07T20:32:37.4830355Z     compiled=False,
2025-05-07T20:32:37.4830424Z )
2025-05-07T20:32:37.4830644Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4830815Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.4830822Z 
2025-05-07T20:32:37.4830900Z     @given(
2025-05-07T20:32:37.4831022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4831119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4831234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4831354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4831466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4831541Z     )
2025-05-07T20:32:37.4831790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4831883Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4831960Z         self,
2025-05-07T20:32:37.4832039Z         T: int,
2025-05-07T20:32:37.4832157Z         D: int,
2025-05-07T20:32:37.4832257Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4832345Z         contiguous: bool,
2025-05-07T20:32:37.4832427Z         compiled: bool,
2025-05-07T20:32:37.4832542Z     ) -> None:
2025-05-07T20:32:37.4832636Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4832707Z     
2025-05-07T20:32:37.4832879Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4832948Z     
2025-05-07T20:32:37.4833036Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4833165Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4833255Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4833332Z         x0 = x[:, :D]
2025-05-07T20:32:37.4833414Z         x1 = x[:, D:]
2025-05-07T20:32:37.4833483Z     
2025-05-07T20:32:37.4833566Z         if contiguous:
2025-05-07T20:32:37.4833700Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4833785Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4833857Z     
2025-05-07T20:32:37.4833951Z         if scale_ub is not None:
2025-05-07T20:32:37.4834054Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4834190Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4834265Z             )
2025-05-07T20:32:37.4834339Z         else:
2025-05-07T20:32:37.4834433Z             scale_ub_tensor = None
2025-05-07T20:32:37.4834504Z     
2025-05-07T20:32:37.4834633Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4834721Z             op = silu_mul_quant
2025-05-07T20:32:37.4834802Z             if compiled:
2025-05-07T20:32:37.4834897Z                 op = torch.compile(op)
2025-05-07T20:32:37.4835007Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4835078Z     
2025-05-07T20:32:37.4835168Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4835173Z 
2025-05-07T20:32:37.4835271Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4835399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4835503Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4835601Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4836103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4836204Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4836564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4836793Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4837179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4837274Z     kernel = self.compile(
2025-05-07T20:32:37.4837660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4837839Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4837966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4837971Z 
2025-05-07T20:32:37.4838175Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad48aef0>
2025-05-07T20:32:37.4838958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4839476Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8986272e0>}
2025-05-07T20:32:37.4840229Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4840465Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad82e2b0>
2025-05-07T20:32:37.4840471Z 
2025-05-07T20:32:37.4840634Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4840941Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4841050Z                            module_map=module_map)
2025-05-07T20:32:37.4841209Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4841307Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4841380Z E       ^
2025-05-07T20:32:37.4841740Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4841744Z 
2025-05-07T20:32:37.4842164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4842211Z 
2025-05-07T20:32:37.4842316Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4842541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4842614Z     T=1,
2025-05-07T20:32:37.4842686Z     D=7168,
2025-05-07T20:32:37.4842773Z     scale_ub=1200.0,
2025-05-07T20:32:37.4842854Z     contiguous=True,
2025-05-07T20:32:37.4842935Z     compiled=True,
2025-05-07T20:32:37.4843009Z )
2025-05-07T20:32:37.4843228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4843391Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.4843396Z 
2025-05-07T20:32:37.4843473Z     @given(
2025-05-07T20:32:37.4843593Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4843690Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4843811Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4843930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4844049Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4844118Z     )
2025-05-07T20:32:37.4844362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4844455Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4844531Z         self,
2025-05-07T20:32:37.4844604Z         T: int,
2025-05-07T20:32:37.4844680Z         D: int,
2025-05-07T20:32:37.4844776Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4844865Z         contiguous: bool,
2025-05-07T20:32:37.4844952Z         compiled: bool,
2025-05-07T20:32:37.4845028Z     ) -> None:
2025-05-07T20:32:37.4845120Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4845192Z     
2025-05-07T20:32:37.4845472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4845549Z     
2025-05-07T20:32:37.4845638Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4845763Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4845850Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4845930Z         x0 = x[:, :D]
2025-05-07T20:32:37.4846006Z         x1 = x[:, D:]
2025-05-07T20:32:37.4846081Z     
2025-05-07T20:32:37.4846161Z         if contiguous:
2025-05-07T20:32:37.4846248Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4846338Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4846407Z     
2025-05-07T20:32:37.4846495Z         if scale_ub is not None:
2025-05-07T20:32:37.4846600Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4846737Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4846813Z             )
2025-05-07T20:32:37.4846885Z         else:
2025-05-07T20:32:37.4846976Z             scale_ub_tensor = None
2025-05-07T20:32:37.4847050Z     
2025-05-07T20:32:37.4847182Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4847269Z             op = silu_mul_quant
2025-05-07T20:32:37.4847360Z             if compiled:
2025-05-07T20:32:37.4847456Z                 op = torch.compile(op)
2025-05-07T20:32:37.4847625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4847699Z     
2025-05-07T20:32:37.4847787Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4847792Z 
2025-05-07T20:32:37.4847888Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4848058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4848158Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4848258Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4848679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.4848769Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.4849271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4849406Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4849766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4849991Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4850333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4850433Z     kernel = self.compile(
2025-05-07T20:32:37.4850815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4850989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4851116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4851123Z 
2025-05-07T20:32:37.4851328Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad48b970>
2025-05-07T20:32:37.4852115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4852625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa8986270a0>}
2025-05-07T20:32:37.4853385Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4853579Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad67ae30>
2025-05-07T20:32:37.4853583Z 
2025-05-07T20:32:37.4853788Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4854058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4854167Z                            module_map=module_map)
2025-05-07T20:32:37.4854331Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4854431Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4854505Z E       ^
2025-05-07T20:32:37.4854865Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4854872Z 
2025-05-07T20:32:37.4855287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4855292Z 
2025-05-07T20:32:37.4855394Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4855847Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4855965Z     T=1,
2025-05-07T20:32:37.4856069Z     D=7168,
2025-05-07T20:32:37.4856156Z     scale_ub=1200.0,
2025-05-07T20:32:37.4856241Z     contiguous=False,
2025-05-07T20:32:37.4856329Z     compiled=True,
2025-05-07T20:32:37.4856402Z )
2025-05-07T20:32:37.4856621Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4856888Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.4856894Z 
2025-05-07T20:32:37.4856970Z     @given(
2025-05-07T20:32:37.4857089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4857250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4857364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4857481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4857596Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4857669Z     )
2025-05-07T20:32:37.4857922Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4858078Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4858156Z         self,
2025-05-07T20:32:37.4858236Z         T: int,
2025-05-07T20:32:37.4858391Z         D: int,
2025-05-07T20:32:37.4858489Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4858584Z         contiguous: bool,
2025-05-07T20:32:37.4858670Z         compiled: bool,
2025-05-07T20:32:37.4858773Z     ) -> None:
2025-05-07T20:32:37.4858874Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4858964Z     
2025-05-07T20:32:37.4859142Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4859222Z     
2025-05-07T20:32:37.4859313Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4859440Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4859527Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4859606Z         x0 = x[:, :D]
2025-05-07T20:32:37.4859697Z         x1 = x[:, D:]
2025-05-07T20:32:37.4859773Z     
2025-05-07T20:32:37.4859858Z         if contiguous:
2025-05-07T20:32:37.4859949Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4860036Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4860110Z     
2025-05-07T20:32:37.4860204Z         if scale_ub is not None:
2025-05-07T20:32:37.4860311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4860451Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4860525Z             )
2025-05-07T20:32:37.4860599Z         else:
2025-05-07T20:32:37.4860695Z             scale_ub_tensor = None
2025-05-07T20:32:37.4860776Z     
2025-05-07T20:32:37.4860909Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4861000Z             op = silu_mul_quant
2025-05-07T20:32:37.4861084Z             if compiled:
2025-05-07T20:32:37.4861188Z                 op = torch.compile(op)
2025-05-07T20:32:37.4861293Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4861365Z     
2025-05-07T20:32:37.4861458Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4861532Z 
2025-05-07T20:32:37.4861631Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4861757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4861862Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4861964Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4862336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.4862429Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.4862933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4863033Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4863392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4863616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4863965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4864057Z     kernel = self.compile(
2025-05-07T20:32:37.4864497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4864676Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4864799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4864843Z 
2025-05-07T20:32:37.4865055Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad8134f0>
2025-05-07T20:32:37.4865841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4866356Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa898626440>}
2025-05-07T20:32:37.4867114Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4867345Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad6e6930>
2025-05-07T20:32:37.4867350Z 
2025-05-07T20:32:37.4867517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4867787Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4867897Z                            module_map=module_map)
2025-05-07T20:32:37.4868058Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4868156Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4868237Z E       ^
2025-05-07T20:32:37.4868598Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4868603Z 
2025-05-07T20:32:37.4869020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4869033Z 
2025-05-07T20:32:37.4869137Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4869361Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4869441Z     T=1,
2025-05-07T20:32:37.4869518Z     D=7168,
2025-05-07T20:32:37.4869600Z     scale_ub=None,
2025-05-07T20:32:37.4869688Z     contiguous=False,
2025-05-07T20:32:37.4869769Z     compiled=True,
2025-05-07T20:32:37.4869840Z )
2025-05-07T20:32:37.4870062Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4870233Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.4870238Z 
2025-05-07T20:32:37.4870320Z     @given(
2025-05-07T20:32:37.4870482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4870586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4870708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4870831Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4870944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4871020Z     )
2025-05-07T20:32:37.4871266Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4871360Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4871438Z         self,
2025-05-07T20:32:37.4871512Z         T: int,
2025-05-07T20:32:37.4871587Z         D: int,
2025-05-07T20:32:37.4871690Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4871778Z         contiguous: bool,
2025-05-07T20:32:37.4871864Z         compiled: bool,
2025-05-07T20:32:37.4871941Z     ) -> None:
2025-05-07T20:32:37.4872050Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4872161Z     
2025-05-07T20:32:37.4872393Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4872468Z     
2025-05-07T20:32:37.4872565Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4872688Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4872827Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4872912Z         x0 = x[:, :D]
2025-05-07T20:32:37.4872988Z         x1 = x[:, D:]
2025-05-07T20:32:37.4873056Z     
2025-05-07T20:32:37.4873140Z         if contiguous:
2025-05-07T20:32:37.4873271Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4873360Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4873428Z     
2025-05-07T20:32:37.4873516Z         if scale_ub is not None:
2025-05-07T20:32:37.4873625Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4873761Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4873835Z             )
2025-05-07T20:32:37.4873911Z         else:
2025-05-07T20:32:37.4874006Z             scale_ub_tensor = None
2025-05-07T20:32:37.4874076Z     
2025-05-07T20:32:37.4874211Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4874340Z             op = silu_mul_quant
2025-05-07T20:32:37.4874421Z             if compiled:
2025-05-07T20:32:37.4874524Z                 op = torch.compile(op)
2025-05-07T20:32:37.4874631Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4874705Z     
2025-05-07T20:32:37.4874794Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4874918Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4874990Z     
2025-05-07T20:32:37.4875127Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4875227Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4875327Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4875448Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4875590Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4875668Z     
2025-05-07T20:32:37.4875766Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4875771Z 
2025-05-07T20:32:37.4875872Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4876001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4876108Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4876247Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4876812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4876913Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4877280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4877505Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4877945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4878202Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4878610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4878866Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4879246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4879417Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4879760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4879834Z     fn()
2025-05-07T20:32:37.4880240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4880319Z     self.fn.run(
2025-05-07T20:32:37.4880660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4880759Z     kernel = self.compile(
2025-05-07T20:32:37.4881183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4881361Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4881525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4881529Z 
2025-05-07T20:32:37.4881736Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad69d720>
2025-05-07T20:32:37.4882527Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4883036Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7adcaadd0>}
2025-05-07T20:32:37.4883832Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4884028Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad6a5fb0>
2025-05-07T20:32:37.4884035Z 
2025-05-07T20:32:37.4884199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4884470Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4884579Z                            module_map=module_map)
2025-05-07T20:32:37.4884744Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4884849Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4884925Z E       ^
2025-05-07T20:32:37.4885282Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4885289Z 
2025-05-07T20:32:37.4885708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4885713Z 
2025-05-07T20:32:37.4885821Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4886043Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4886119Z     T=1,
2025-05-07T20:32:37.4886199Z     D=5120,
2025-05-07T20:32:37.4886284Z     scale_ub=1200.0,
2025-05-07T20:32:37.4886374Z     contiguous=False,
2025-05-07T20:32:37.4886461Z     compiled=True,
2025-05-07T20:32:37.4886533Z )
2025-05-07T20:32:37.4886750Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4886961Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.4886966Z 
2025-05-07T20:32:37.4887041Z     @given(
2025-05-07T20:32:37.4887163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4887263Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4887380Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4887504Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4887619Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4887694Z     )
2025-05-07T20:32:37.4887944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4888038Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4888114Z         self,
2025-05-07T20:32:37.4888190Z         T: int,
2025-05-07T20:32:37.4888265Z         D: int,
2025-05-07T20:32:37.4888364Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4888456Z         contiguous: bool,
2025-05-07T20:32:37.4888543Z         compiled: bool,
2025-05-07T20:32:37.4888631Z     ) -> None:
2025-05-07T20:32:37.4888728Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4888799Z     
2025-05-07T20:32:37.4888970Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4889044Z     
2025-05-07T20:32:37.4889179Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4889311Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4889398Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4889476Z         x0 = x[:, :D]
2025-05-07T20:32:37.4889595Z         x1 = x[:, D:]
2025-05-07T20:32:37.4889666Z     
2025-05-07T20:32:37.4889748Z         if contiguous:
2025-05-07T20:32:37.4889845Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4889931Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4890004Z     
2025-05-07T20:32:37.4890095Z         if scale_ub is not None:
2025-05-07T20:32:37.4890200Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4890341Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4890417Z             )
2025-05-07T20:32:37.4890493Z         else:
2025-05-07T20:32:37.4890591Z             scale_ub_tensor = None
2025-05-07T20:32:37.4890708Z     
2025-05-07T20:32:37.4890838Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4890933Z             op = silu_mul_quant
2025-05-07T20:32:37.4891016Z             if compiled:
2025-05-07T20:32:37.4891115Z                 op = torch.compile(op)
2025-05-07T20:32:37.4891225Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4891299Z     
2025-05-07T20:32:37.4891392Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4891396Z 
2025-05-07T20:32:37.4891493Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4891619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4891722Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4891822Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4892195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.4892291Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.4892792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4892894Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4893252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4893476Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4893823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4893916Z     kernel = self.compile(
2025-05-07T20:32:37.4894300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4894519Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4894645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4894652Z 
2025-05-07T20:32:37.4894861Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad857b20>
2025-05-07T20:32:37.4895649Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4896159Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7adcabeb0>}
2025-05-07T20:32:37.4896923Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4897118Z context = <triton._C.libtriton.ir.context object at 0x7fa7ada4fd70>
2025-05-07T20:32:37.4897123Z 
2025-05-07T20:32:37.4897292Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4897598Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4897708Z                            module_map=module_map)
2025-05-07T20:32:37.4897870Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4898080Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4898184Z E       ^
2025-05-07T20:32:37.4898556Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4898561Z 
2025-05-07T20:32:37.4898977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4898982Z 
2025-05-07T20:32:37.4899093Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4899315Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4899392Z     T=1,
2025-05-07T20:32:37.4899512Z     D=5120,
2025-05-07T20:32:37.4899595Z     scale_ub=1200.0,
2025-05-07T20:32:37.4899684Z     contiguous=False,
2025-05-07T20:32:37.4899770Z     compiled=False,
2025-05-07T20:32:37.4899841Z )
2025-05-07T20:32:37.4900062Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4900230Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.4900238Z 
2025-05-07T20:32:37.4900315Z     @given(
2025-05-07T20:32:37.4900437Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4900537Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4900654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4900773Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4900887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4900962Z     )
2025-05-07T20:32:37.4901206Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4901302Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4901381Z         self,
2025-05-07T20:32:37.4901459Z         T: int,
2025-05-07T20:32:37.4901533Z         D: int,
2025-05-07T20:32:37.4901639Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4901728Z         contiguous: bool,
2025-05-07T20:32:37.4901812Z         compiled: bool,
2025-05-07T20:32:37.4901894Z     ) -> None:
2025-05-07T20:32:37.4901988Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4902061Z     
2025-05-07T20:32:37.4902229Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4902302Z     
2025-05-07T20:32:37.4902394Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4902518Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4902605Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4902742Z         x0 = x[:, :D]
2025-05-07T20:32:37.4902822Z         x1 = x[:, D:]
2025-05-07T20:32:37.4902893Z     
2025-05-07T20:32:37.4902979Z         if contiguous:
2025-05-07T20:32:37.4903074Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4903162Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4903238Z     
2025-05-07T20:32:37.4903328Z         if scale_ub is not None:
2025-05-07T20:32:37.4903435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4903568Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4903647Z             )
2025-05-07T20:32:37.4903726Z         else:
2025-05-07T20:32:37.4903820Z             scale_ub_tensor = None
2025-05-07T20:32:37.4903892Z     
2025-05-07T20:32:37.4904028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4904120Z             op = silu_mul_quant
2025-05-07T20:32:37.4904204Z             if compiled:
2025-05-07T20:32:37.4904304Z                 op = torch.compile(op)
2025-05-07T20:32:37.4904411Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4904483Z     
2025-05-07T20:32:37.4904582Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4904590Z 
2025-05-07T20:32:37.4904687Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4904861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4904963Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4905063Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4905568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4905707Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4906067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4906294Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4906640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4906737Z     kernel = self.compile(
2025-05-07T20:32:37.4907188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4907365Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4907492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4907496Z 
2025-05-07T20:32:37.4907705Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ada1a4a0>
2025-05-07T20:32:37.4908490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4908997Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa898624940>}
2025-05-07T20:32:37.4909751Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4909948Z context = <triton._C.libtriton.ir.context object at 0x7fa7ada55430>
2025-05-07T20:32:37.4909952Z 
2025-05-07T20:32:37.4910117Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4910390Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4910496Z                            module_map=module_map)
2025-05-07T20:32:37.4910657Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4910758Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4910833Z E       ^
2025-05-07T20:32:37.4911233Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4911242Z 
2025-05-07T20:32:37.4911657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4911664Z 
2025-05-07T20:32:37.4911770Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4911996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4912071Z     T=16384,
2025-05-07T20:32:37.4912149Z     D=5120,
2025-05-07T20:32:37.4912234Z     scale_ub=1200.0,
2025-05-07T20:32:37.4912319Z     contiguous=False,
2025-05-07T20:32:37.4912402Z     compiled=True,
2025-05-07T20:32:37.4912476Z )
2025-05-07T20:32:37.4912693Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4912873Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.4912878Z 
2025-05-07T20:32:37.4912954Z     @given(
2025-05-07T20:32:37.4913075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4913178Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4913292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4913412Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4913570Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4913643Z     )
2025-05-07T20:32:37.4913891Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4914022Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4914096Z         self,
2025-05-07T20:32:37.4914176Z         T: int,
2025-05-07T20:32:37.4914251Z         D: int,
2025-05-07T20:32:37.4914349Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4914439Z         contiguous: bool,
2025-05-07T20:32:37.4914522Z         compiled: bool,
2025-05-07T20:32:37.4914599Z     ) -> None:
2025-05-07T20:32:37.4914696Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4914766Z     
2025-05-07T20:32:37.4914939Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4915014Z     
2025-05-07T20:32:37.4915106Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4915271Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4915363Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4915440Z         x0 = x[:, :D]
2025-05-07T20:32:37.4915520Z         x1 = x[:, D:]
2025-05-07T20:32:37.4915591Z     
2025-05-07T20:32:37.4915671Z         if contiguous:
2025-05-07T20:32:37.4915765Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4915852Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4915922Z     
2025-05-07T20:32:37.4916015Z         if scale_ub is not None:
2025-05-07T20:32:37.4916119Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4916251Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4916328Z             )
2025-05-07T20:32:37.4916403Z         else:
2025-05-07T20:32:37.4916497Z             scale_ub_tensor = None
2025-05-07T20:32:37.4916571Z     
2025-05-07T20:32:37.4916702Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4916796Z             op = silu_mul_quant
2025-05-07T20:32:37.4916877Z             if compiled:
2025-05-07T20:32:37.4916978Z                 op = torch.compile(op)
2025-05-07T20:32:37.4917085Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4917155Z     
2025-05-07T20:32:37.4917248Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4917255Z 
2025-05-07T20:32:37.4921774Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4921913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4922015Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4922121Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4922500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.4922660Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.4923161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4923258Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4923620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4923840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4924186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4924277Z     kernel = self.compile(
2025-05-07T20:32:37.4924660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4924835Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4924960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4924965Z 
2025-05-07T20:32:37.4925174Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad855780>
2025-05-07T20:32:37.4926027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4926540Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ada448b0>}
2025-05-07T20:32:37.4927344Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4927535Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad135cf0>
2025-05-07T20:32:37.4927540Z 
2025-05-07T20:32:37.4927714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4927977Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4928194Z                            module_map=module_map)
2025-05-07T20:32:37.4928359Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4928455Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4928533Z E       ^
2025-05-07T20:32:37.4928890Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4928898Z 
2025-05-07T20:32:37.4929312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4929316Z 
2025-05-07T20:32:37.4929424Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4929645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4929725Z     T=2048,
2025-05-07T20:32:37.4929799Z     D=7168,
2025-05-07T20:32:37.4929880Z     scale_ub=1200.0,
2025-05-07T20:32:37.4929970Z     contiguous=False,
2025-05-07T20:32:37.4930054Z     compiled=True,
2025-05-07T20:32:37.4930124Z )
2025-05-07T20:32:37.4930345Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4930517Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.4930522Z 
2025-05-07T20:32:37.4930595Z     @given(
2025-05-07T20:32:37.4930717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4930819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4930932Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4931051Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4931162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4931234Z     )
2025-05-07T20:32:37.4931527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4931620Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4931699Z         self,
2025-05-07T20:32:37.4931773Z         T: int,
2025-05-07T20:32:37.4931845Z         D: int,
2025-05-07T20:32:37.4931942Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4932031Z         contiguous: bool,
2025-05-07T20:32:37.4932113Z         compiled: bool,
2025-05-07T20:32:37.4932190Z     ) -> None:
2025-05-07T20:32:37.4932282Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4932355Z     
2025-05-07T20:32:37.4932525Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4932600Z     
2025-05-07T20:32:37.4932690Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4932813Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4932898Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4932977Z         x0 = x[:, :D]
2025-05-07T20:32:37.4933051Z         x1 = x[:, D:]
2025-05-07T20:32:37.4933123Z     
2025-05-07T20:32:37.4933211Z         if contiguous:
2025-05-07T20:32:37.4933301Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4933388Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4933463Z     
2025-05-07T20:32:37.4933550Z         if scale_ub is not None:
2025-05-07T20:32:37.4933697Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4933837Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4933908Z             )
2025-05-07T20:32:37.4933985Z         else:
2025-05-07T20:32:37.4934115Z             scale_ub_tensor = None
2025-05-07T20:32:37.4934183Z     
2025-05-07T20:32:37.4934316Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4934403Z             op = silu_mul_quant
2025-05-07T20:32:37.4934483Z             if compiled:
2025-05-07T20:32:37.4934583Z                 op = torch.compile(op)
2025-05-07T20:32:37.4934685Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4934753Z     
2025-05-07T20:32:37.4934846Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4934850Z 
2025-05-07T20:32:37.4934945Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4935073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4935213Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4935313Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4935689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.4935782Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.4936282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4936381Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4936743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4936972Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4937314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4937408Z     kernel = self.compile(
2025-05-07T20:32:37.4937796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4937968Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4938179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4938187Z 
2025-05-07T20:32:37.4938420Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad153be0>
2025-05-07T20:32:37.4939235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4939792Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ada45090>}
2025-05-07T20:32:37.4940558Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4940752Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad148970>
2025-05-07T20:32:37.4940759Z 
2025-05-07T20:32:37.4940926Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4941195Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4941304Z                            module_map=module_map)
2025-05-07T20:32:37.4941463Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4941561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4941637Z E       ^
2025-05-07T20:32:37.4941996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4942003Z 
2025-05-07T20:32:37.4942465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4942470Z 
2025-05-07T20:32:37.4942574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4942796Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4942908Z     T=1,
2025-05-07T20:32:37.4942986Z     D=5120,
2025-05-07T20:32:37.4943067Z     scale_ub=None,
2025-05-07T20:32:37.4943151Z     contiguous=False,
2025-05-07T20:32:37.4943237Z     compiled=False,
2025-05-07T20:32:37.4943306Z )
2025-05-07T20:32:37.4943527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4943694Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.4943699Z 
2025-05-07T20:32:37.4943775Z     @given(
2025-05-07T20:32:37.4943896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4943993Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4944151Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4944273Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4944385Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4944459Z     )
2025-05-07T20:32:37.4944704Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4944799Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4944875Z         self,
2025-05-07T20:32:37.4944948Z         T: int,
2025-05-07T20:32:37.4945020Z         D: int,
2025-05-07T20:32:37.4945120Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4945207Z         contiguous: bool,
2025-05-07T20:32:37.4945292Z         compiled: bool,
2025-05-07T20:32:37.4945370Z     ) -> None:
2025-05-07T20:32:37.4945466Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4945535Z     
2025-05-07T20:32:37.4945708Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4945781Z     
2025-05-07T20:32:37.4945871Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4946001Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4946088Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4946166Z         x0 = x[:, :D]
2025-05-07T20:32:37.4946243Z         x1 = x[:, D:]
2025-05-07T20:32:37.4946313Z     
2025-05-07T20:32:37.4946396Z         if contiguous:
2025-05-07T20:32:37.4946485Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4946571Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4946644Z     
2025-05-07T20:32:37.4946733Z         if scale_ub is not None:
2025-05-07T20:32:37.4946836Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4946974Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4947046Z             )
2025-05-07T20:32:37.4947164Z         else:
2025-05-07T20:32:37.4947258Z             scale_ub_tensor = None
2025-05-07T20:32:37.4947326Z     
2025-05-07T20:32:37.4947457Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4947546Z             op = silu_mul_quant
2025-05-07T20:32:37.4947627Z             if compiled:
2025-05-07T20:32:37.4947725Z                 op = torch.compile(op)
2025-05-07T20:32:37.4947827Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4947894Z     
2025-05-07T20:32:37.4947987Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4947991Z 
2025-05-07T20:32:37.4948084Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4948207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4948308Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4948404Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4948965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4949061Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4949423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4949691Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4950033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4950163Z     kernel = self.compile(
2025-05-07T20:32:37.4950550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4950723Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4950847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4950852Z 
2025-05-07T20:32:37.4951060Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad182410>
2025-05-07T20:32:37.4951846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4952400Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ada457e0>}
2025-05-07T20:32:37.4953156Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4953351Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad016530>
2025-05-07T20:32:37.4953356Z 
2025-05-07T20:32:37.4953519Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4953788Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4953893Z                            module_map=module_map)
2025-05-07T20:32:37.4954056Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4954155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4954229Z E       ^
2025-05-07T20:32:37.4954586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4954590Z 
2025-05-07T20:32:37.4955013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4955018Z 
2025-05-07T20:32:37.4955119Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4955346Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4955417Z     T=4096,
2025-05-07T20:32:37.4955488Z     D=7168,
2025-05-07T20:32:37.4955770Z     scale_ub=1200.0,
2025-05-07T20:32:37.4956027Z     contiguous=False,
2025-05-07T20:32:37.4956115Z     compiled=False,
2025-05-07T20:32:37.4956192Z )
2025-05-07T20:32:37.4956410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4956593Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.4956601Z 
2025-05-07T20:32:37.4956673Z     @given(
2025-05-07T20:32:37.4956790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4956891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4957009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4957126Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4957243Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4957312Z     )
2025-05-07T20:32:37.4957561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4957654Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4957730Z         self,
2025-05-07T20:32:37.4957802Z         T: int,
2025-05-07T20:32:37.4957874Z         D: int,
2025-05-07T20:32:37.4957970Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4958062Z         contiguous: bool,
2025-05-07T20:32:37.4958144Z         compiled: bool,
2025-05-07T20:32:37.4958299Z     ) -> None:
2025-05-07T20:32:37.4958398Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4958465Z     
2025-05-07T20:32:37.4958634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4958764Z     
2025-05-07T20:32:37.4958853Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4958975Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4959063Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4959139Z         x0 = x[:, :D]
2025-05-07T20:32:37.4959215Z         x1 = x[:, D:]
2025-05-07T20:32:37.4959285Z     
2025-05-07T20:32:37.4959365Z         if contiguous:
2025-05-07T20:32:37.4959455Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4959546Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4959613Z     
2025-05-07T20:32:37.4959709Z         if scale_ub is not None:
2025-05-07T20:32:37.4959877Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4960012Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4960088Z             )
2025-05-07T20:32:37.4960160Z         else:
2025-05-07T20:32:37.4960251Z             scale_ub_tensor = None
2025-05-07T20:32:37.4960322Z     
2025-05-07T20:32:37.4960451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4960540Z             op = silu_mul_quant
2025-05-07T20:32:37.4960623Z             if compiled:
2025-05-07T20:32:37.4960718Z                 op = torch.compile(op)
2025-05-07T20:32:37.4960820Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4960892Z     
2025-05-07T20:32:37.4960979Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4960984Z 
2025-05-07T20:32:37.4961081Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4961209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4961308Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4961412Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4961919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4962015Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4962378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4962603Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4962949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4963046Z     kernel = self.compile(
2025-05-07T20:32:37.4963479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4963661Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4963782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4963790Z 
2025-05-07T20:32:37.4964000Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad1eae30>
2025-05-07T20:32:37.4964786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4965302Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ada46200>}
2025-05-07T20:32:37.4966067Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4966258Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad00dab0>
2025-05-07T20:32:37.4966264Z 
2025-05-07T20:32:37.4966431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4967323Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4967435Z                            module_map=module_map)
2025-05-07T20:32:37.4967643Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4967739Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4967813Z E       ^
2025-05-07T20:32:37.4968171Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4968176Z 
2025-05-07T20:32:37.4968645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4968653Z 
2025-05-07T20:32:37.4968761Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4969030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4969208Z     T=16384,
2025-05-07T20:32:37.4969288Z     D=7168,
2025-05-07T20:32:37.4969371Z     scale_ub=None,
2025-05-07T20:32:37.4969455Z     contiguous=True,
2025-05-07T20:32:37.4969533Z     compiled=True,
2025-05-07T20:32:37.4969601Z )
2025-05-07T20:32:37.4969824Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4970003Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.4970008Z 
2025-05-07T20:32:37.4970078Z     @given(
2025-05-07T20:32:37.4970198Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4970295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4970408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4970526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4970638Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4970711Z     )
2025-05-07T20:32:37.4970959Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4971054Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4971131Z         self,
2025-05-07T20:32:37.4971203Z         T: int,
2025-05-07T20:32:37.4971275Z         D: int,
2025-05-07T20:32:37.4971374Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4971461Z         contiguous: bool,
2025-05-07T20:32:37.4971544Z         compiled: bool,
2025-05-07T20:32:37.4971621Z     ) -> None:
2025-05-07T20:32:37.4971715Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4971781Z     
2025-05-07T20:32:37.4971952Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4972022Z     
2025-05-07T20:32:37.4972115Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4972288Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4972375Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4972453Z         x0 = x[:, :D]
2025-05-07T20:32:37.4972529Z         x1 = x[:, D:]
2025-05-07T20:32:37.4972599Z     
2025-05-07T20:32:37.4972682Z         if contiguous:
2025-05-07T20:32:37.4972771Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4972858Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4972928Z     
2025-05-07T20:32:37.4973015Z         if scale_ub is not None:
2025-05-07T20:32:37.4973118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4973255Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4973328Z             )
2025-05-07T20:32:37.4973403Z         else:
2025-05-07T20:32:37.4973495Z             scale_ub_tensor = None
2025-05-07T20:32:37.4973562Z     
2025-05-07T20:32:37.4973694Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4973784Z             op = silu_mul_quant
2025-05-07T20:32:37.4973867Z             if compiled:
2025-05-07T20:32:37.4973967Z                 op = torch.compile(op)
2025-05-07T20:32:37.4974073Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4974145Z     
2025-05-07T20:32:37.4974236Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4974241Z 
2025-05-07T20:32:37.4974381Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4974507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4974609Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4974743Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4975118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.4975210Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.4975707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4975807Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4976168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4976395Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4976779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4976870Z     kernel = self.compile(
2025-05-07T20:32:37.4977256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4977435Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4977557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4977562Z 
2025-05-07T20:32:37.4977770Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad09a9b0>
2025-05-07T20:32:37.4978668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4979207Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ada47760>}
2025-05-07T20:32:37.4979963Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4980160Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad066ef0>
2025-05-07T20:32:37.4980165Z 
2025-05-07T20:32:37.4980329Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4980594Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4980749Z                            module_map=module_map)
2025-05-07T20:32:37.4980914Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4981010Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4981089Z E       ^
2025-05-07T20:32:37.4981447Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4981452Z 
2025-05-07T20:32:37.4981872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4981879Z 
2025-05-07T20:32:37.4981980Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4982201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4982276Z     T=4096,
2025-05-07T20:32:37.4982347Z     D=5120,
2025-05-07T20:32:37.4982424Z     scale_ub=None,
2025-05-07T20:32:37.4982510Z     contiguous=False,
2025-05-07T20:32:37.4982589Z     compiled=True,
2025-05-07T20:32:37.4982662Z )
2025-05-07T20:32:37.4982881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4983052Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.4983060Z 
2025-05-07T20:32:37.4983133Z     @given(
2025-05-07T20:32:37.4983292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4983389Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4983508Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4983625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4983775Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4983850Z     )
2025-05-07T20:32:37.4984096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4984191Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4984267Z         self,
2025-05-07T20:32:37.4984341Z         T: int,
2025-05-07T20:32:37.4984414Z         D: int,
2025-05-07T20:32:37.4984511Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4984602Z         contiguous: bool,
2025-05-07T20:32:37.4984685Z         compiled: bool,
2025-05-07T20:32:37.4984827Z     ) -> None:
2025-05-07T20:32:37.4984922Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4984989Z     
2025-05-07T20:32:37.4985159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4985230Z     
2025-05-07T20:32:37.4985318Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4985442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4985531Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4985606Z         x0 = x[:, :D]
2025-05-07T20:32:37.4985682Z         x1 = x[:, D:]
2025-05-07T20:32:37.4985753Z     
2025-05-07T20:32:37.4985834Z         if contiguous:
2025-05-07T20:32:37.4985925Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4986010Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4986077Z     
2025-05-07T20:32:37.4986166Z         if scale_ub is not None:
2025-05-07T20:32:37.4986271Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4986403Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4986481Z             )
2025-05-07T20:32:37.4986552Z         else:
2025-05-07T20:32:37.4986646Z             scale_ub_tensor = None
2025-05-07T20:32:37.4986719Z     
2025-05-07T20:32:37.4986848Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4986936Z             op = silu_mul_quant
2025-05-07T20:32:37.4987020Z             if compiled:
2025-05-07T20:32:37.4987119Z                 op = torch.compile(op)
2025-05-07T20:32:37.4987225Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4987292Z     
2025-05-07T20:32:37.4987380Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.4987384Z 
2025-05-07T20:32:37.4987481Z moe/activation_test.py:117: 
2025-05-07T20:32:37.4987604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4987749Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.4987851Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4988221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.4988315Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.4988818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.4988912Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.4989279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4989503Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4989845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4989940Z     kernel = self.compile(
2025-05-07T20:32:37.4990327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4990502Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4990626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4990671Z 
2025-05-07T20:32:37.4990878Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad0a7bb0>
2025-05-07T20:32:37.4991665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4992214Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad768280>}
2025-05-07T20:32:37.4992973Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4993165Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad7a1670>
2025-05-07T20:32:37.4993208Z 
2025-05-07T20:32:37.4993374Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4993642Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4993749Z                            module_map=module_map)
2025-05-07T20:32:37.4993915Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4994011Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4994084Z E       ^
2025-05-07T20:32:37.4994442Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4994446Z 
2025-05-07T20:32:37.4994864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4994868Z 
2025-05-07T20:32:37.4994974Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4995197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4995271Z     T=4096,
2025-05-07T20:32:37.4995345Z     D=5120,
2025-05-07T20:32:37.4995424Z     scale_ub=1200.0,
2025-05-07T20:32:37.4995507Z     contiguous=False,
2025-05-07T20:32:37.4995591Z     compiled=False,
2025-05-07T20:32:37.4995663Z )
2025-05-07T20:32:37.4995879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4996055Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.4996059Z 
2025-05-07T20:32:37.4996135Z     @given(
2025-05-07T20:32:37.4996253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4996350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4996506Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4996625Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4996735Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4996807Z     )
2025-05-07T20:32:37.4997058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4997151Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4997223Z         self,
2025-05-07T20:32:37.4997298Z         T: int,
2025-05-07T20:32:37.4997369Z         D: int,
2025-05-07T20:32:37.4997467Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4997555Z         contiguous: bool,
2025-05-07T20:32:37.4997637Z         compiled: bool,
2025-05-07T20:32:37.4997717Z     ) -> None:
2025-05-07T20:32:37.4997808Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4997876Z     
2025-05-07T20:32:37.4998047Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4998118Z     
2025-05-07T20:32:37.4998207Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4998334Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4998419Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4998499Z         x0 = x[:, :D]
2025-05-07T20:32:37.4998576Z         x1 = x[:, D:]
2025-05-07T20:32:37.4998643Z     
2025-05-07T20:32:37.4998765Z         if contiguous:
2025-05-07T20:32:37.4998859Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4998945Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4999017Z     
2025-05-07T20:32:37.4999106Z         if scale_ub is not None:
2025-05-07T20:32:37.4999249Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4999385Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4999455Z             )
2025-05-07T20:32:37.4999527Z         else:
2025-05-07T20:32:37.4999620Z             scale_ub_tensor = None
2025-05-07T20:32:37.4999689Z     
2025-05-07T20:32:37.4999817Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4999912Z             op = silu_mul_quant
2025-05-07T20:32:37.4999992Z             if compiled:
2025-05-07T20:32:37.5000086Z                 op = torch.compile(op)
2025-05-07T20:32:37.5000238Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5000307Z     
2025-05-07T20:32:37.5000400Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5000405Z 
2025-05-07T20:32:37.5000498Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5000621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5000724Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5000823Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5001327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5001424Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5001782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5002010Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5002353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5002447Z     kernel = self.compile(
2025-05-07T20:32:37.5002835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5003008Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5003131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5003140Z 
2025-05-07T20:32:37.5003344Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad75a2c0>
2025-05-07T20:32:37.5004169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5004681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad769000>}
2025-05-07T20:32:37.5005439Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5005632Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad7b2970>
2025-05-07T20:32:37.5005639Z 
2025-05-07T20:32:37.5005803Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5006067Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5006174Z                            module_map=module_map)
2025-05-07T20:32:37.5006333Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5006430Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5006507Z E       ^
2025-05-07T20:32:37.5006862Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5006869Z 
2025-05-07T20:32:37.5007330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5007335Z 
2025-05-07T20:32:37.5007437Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5007696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5007771Z     T=4096,
2025-05-07T20:32:37.5007844Z     D=5120,
2025-05-07T20:32:37.5007926Z     scale_ub=1200.0,
2025-05-07T20:32:37.5008009Z     contiguous=False,
2025-05-07T20:32:37.5008089Z     compiled=True,
2025-05-07T20:32:37.5008159Z )
2025-05-07T20:32:37.5008378Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5008554Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.5008559Z 
2025-05-07T20:32:37.5008633Z     @given(
2025-05-07T20:32:37.5008748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5008888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5009006Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5009122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5009235Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5009309Z     )
2025-05-07T20:32:37.5009552Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5009648Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5009722Z         self,
2025-05-07T20:32:37.5009796Z         T: int,
2025-05-07T20:32:37.5009873Z         D: int,
2025-05-07T20:32:37.5009969Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5010055Z         contiguous: bool,
2025-05-07T20:32:37.5010144Z         compiled: bool,
2025-05-07T20:32:37.5010218Z     ) -> None:
2025-05-07T20:32:37.5010310Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5010381Z     
2025-05-07T20:32:37.5010549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5010623Z     
2025-05-07T20:32:37.5010715Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5010836Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5010924Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5010999Z         x0 = x[:, :D]
2025-05-07T20:32:37.5011076Z         x1 = x[:, D:]
2025-05-07T20:32:37.5011146Z     
2025-05-07T20:32:37.5011227Z         if contiguous:
2025-05-07T20:32:37.5011316Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5011403Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5011473Z     
2025-05-07T20:32:37.5011562Z         if scale_ub is not None:
2025-05-07T20:32:37.5011667Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5011842Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5011915Z             )
2025-05-07T20:32:37.5011990Z         else:
2025-05-07T20:32:37.5012080Z             scale_ub_tensor = None
2025-05-07T20:32:37.5012156Z     
2025-05-07T20:32:37.5012287Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5012373Z             op = silu_mul_quant
2025-05-07T20:32:37.5012457Z             if compiled:
2025-05-07T20:32:37.5012552Z                 op = torch.compile(op)
2025-05-07T20:32:37.5012655Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5012731Z     
2025-05-07T20:32:37.5012818Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5012822Z 
2025-05-07T20:32:37.5012916Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5013042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5013140Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5013240Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5013611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5013702Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5014250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5014346Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5014703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5014992Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5015330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5015425Z     kernel = self.compile(
2025-05-07T20:32:37.5015809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5015986Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5016109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5016154Z 
2025-05-07T20:32:37.5016359Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad0981f0>
2025-05-07T20:32:37.5017147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5017657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad768700>}
2025-05-07T20:32:37.5018496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5018691Z context = <triton._C.libtriton.ir.context object at 0x7fa7acfab0b0>
2025-05-07T20:32:37.5018695Z 
2025-05-07T20:32:37.5018858Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5019129Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5019233Z                            module_map=module_map)
2025-05-07T20:32:37.5019393Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5019496Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5019569Z E       ^
2025-05-07T20:32:37.5019929Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5019934Z 
2025-05-07T20:32:37.5020350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5020354Z 
2025-05-07T20:32:37.5020499Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5020725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5020800Z     T=2048,
2025-05-07T20:32:37.5020873Z     D=7168,
2025-05-07T20:32:37.5020959Z     scale_ub=1200.0,
2025-05-07T20:32:37.5021046Z     contiguous=False,
2025-05-07T20:32:37.5021132Z     compiled=False,
2025-05-07T20:32:37.5021200Z )
2025-05-07T20:32:37.5021416Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5021597Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.5021602Z 
2025-05-07T20:32:37.5021674Z     @given(
2025-05-07T20:32:37.5021790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5021889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5022002Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5022116Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5022234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5022303Z     )
2025-05-07T20:32:37.5022549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5022641Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5022713Z         self,
2025-05-07T20:32:37.5022833Z         T: int,
2025-05-07T20:32:37.5022908Z         D: int,
2025-05-07T20:32:37.5023003Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5023092Z         contiguous: bool,
2025-05-07T20:32:37.5023214Z         compiled: bool,
2025-05-07T20:32:37.5023288Z     ) -> None:
2025-05-07T20:32:37.5023385Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5023456Z     
2025-05-07T20:32:37.5023622Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5023697Z     
2025-05-07T20:32:37.5023785Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5023910Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5023998Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5024073Z         x0 = x[:, :D]
2025-05-07T20:32:37.5024151Z         x1 = x[:, D:]
2025-05-07T20:32:37.5024221Z     
2025-05-07T20:32:37.5024342Z         if contiguous:
2025-05-07T20:32:37.5024434Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5024521Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5024588Z     
2025-05-07T20:32:37.5024678Z         if scale_ub is not None:
2025-05-07T20:32:37.5024781Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5024915Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5024989Z             )
2025-05-07T20:32:37.5025060Z         else:
2025-05-07T20:32:37.5025150Z             scale_ub_tensor = None
2025-05-07T20:32:37.5025225Z     
2025-05-07T20:32:37.5025354Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5025444Z             op = silu_mul_quant
2025-05-07T20:32:37.5025524Z             if compiled:
2025-05-07T20:32:37.5025622Z                 op = torch.compile(op)
2025-05-07T20:32:37.5025730Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5025799Z     
2025-05-07T20:32:37.5025886Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5025893Z 
2025-05-07T20:32:37.5025990Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5026116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5026213Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5026314Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5026821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5026918Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5027277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5027498Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5027888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5027982Z     kernel = self.compile(
2025-05-07T20:32:37.5028375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5028547Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5028667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5028674Z 
2025-05-07T20:32:37.5028881Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad73e590>
2025-05-07T20:32:37.5029666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5030180Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad769240>}
2025-05-07T20:32:37.5030973Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5031167Z context = <triton._C.libtriton.ir.context object at 0x7fa7acfc8770>
2025-05-07T20:32:37.5031172Z 
2025-05-07T20:32:37.5031338Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5031641Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5031750Z                            module_map=module_map)
2025-05-07T20:32:37.5031910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5032006Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5032084Z E       ^
2025-05-07T20:32:37.5032442Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5032447Z 
2025-05-07T20:32:37.5032904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5032913Z 
2025-05-07T20:32:37.5033017Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5033237Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5033315Z     T=1,
2025-05-07T20:32:37.5033388Z     D=7168,
2025-05-07T20:32:37.5033465Z     scale_ub=None,
2025-05-07T20:32:37.5033553Z     contiguous=True,
2025-05-07T20:32:37.5033633Z     compiled=False,
2025-05-07T20:32:37.5033701Z )
2025-05-07T20:32:37.5033921Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5034081Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.5034086Z 
2025-05-07T20:32:37.5034163Z     @given(
2025-05-07T20:32:37.5034281Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5034377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5034495Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5034611Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5034722Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5034793Z     )
2025-05-07T20:32:37.5035037Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5035130Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5035205Z         self,
2025-05-07T20:32:37.5035277Z         T: int,
2025-05-07T20:32:37.5035349Z         D: int,
2025-05-07T20:32:37.5035446Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5035532Z         contiguous: bool,
2025-05-07T20:32:37.5035616Z         compiled: bool,
2025-05-07T20:32:37.5035689Z     ) -> None:
2025-05-07T20:32:37.5035781Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5035897Z     
2025-05-07T20:32:37.5036064Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5036134Z     
2025-05-07T20:32:37.5036228Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5036349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5036435Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5036513Z         x0 = x[:, :D]
2025-05-07T20:32:37.5036588Z         x1 = x[:, D:]
2025-05-07T20:32:37.5036657Z     
2025-05-07T20:32:37.5036740Z         if contiguous:
2025-05-07T20:32:37.5036831Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5036919Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5036986Z     
2025-05-07T20:32:37.5037073Z         if scale_ub is not None:
2025-05-07T20:32:37.5037177Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5037309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5037382Z             )
2025-05-07T20:32:37.5037456Z         else:
2025-05-07T20:32:37.5037549Z             scale_ub_tensor = None
2025-05-07T20:32:37.5037617Z     
2025-05-07T20:32:37.5037747Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5037836Z             op = silu_mul_quant
2025-05-07T20:32:37.5037917Z             if compiled:
2025-05-07T20:32:37.5038057Z                 op = torch.compile(op)
2025-05-07T20:32:37.5038161Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5038232Z     
2025-05-07T20:32:37.5038319Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5038361Z 
2025-05-07T20:32:37.5038456Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5038607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5038717Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5038825Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5039332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5039428Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5039786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5040053Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5040398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5040489Z     kernel = self.compile(
2025-05-07T20:32:37.5040875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5041048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5041170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5041174Z 
2025-05-07T20:32:37.5041378Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acfd0640>
2025-05-07T20:32:37.5045556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5046096Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad76a050>}
2025-05-07T20:32:37.5046856Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5047058Z context = <triton._C.libtriton.ir.context object at 0x7fa7acfe8670>
2025-05-07T20:32:37.5047064Z 
2025-05-07T20:32:37.5047231Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5047561Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5047668Z                            module_map=module_map)
2025-05-07T20:32:37.5047836Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5047934Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5048005Z E       ^
2025-05-07T20:32:37.5048393Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5048398Z 
2025-05-07T20:32:37.5048841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5048847Z 
2025-05-07T20:32:37.5048952Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5049173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5049244Z     T=16384,
2025-05-07T20:32:37.5049321Z     D=7168,
2025-05-07T20:32:37.5049400Z     scale_ub=1200.0,
2025-05-07T20:32:37.5049483Z     contiguous=False,
2025-05-07T20:32:37.5049569Z     compiled=True,
2025-05-07T20:32:37.5049637Z )
2025-05-07T20:32:37.5049853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5050040Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.5050114Z 
2025-05-07T20:32:37.5050187Z     @given(
2025-05-07T20:32:37.5050306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5050402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5050557Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5050678Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5050789Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5050858Z     )
2025-05-07T20:32:37.5051109Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5051200Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5051271Z         self,
2025-05-07T20:32:37.5051348Z         T: int,
2025-05-07T20:32:37.5051419Z         D: int,
2025-05-07T20:32:37.5051517Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5051607Z         contiguous: bool,
2025-05-07T20:32:37.5051731Z         compiled: bool,
2025-05-07T20:32:37.5051808Z     ) -> None:
2025-05-07T20:32:37.5051904Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5051971Z     
2025-05-07T20:32:37.5052143Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5052213Z     
2025-05-07T20:32:37.5052304Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5052431Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5052517Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5052592Z         x0 = x[:, :D]
2025-05-07T20:32:37.5052673Z         x1 = x[:, D:]
2025-05-07T20:32:37.5052742Z     
2025-05-07T20:32:37.5052822Z         if contiguous:
2025-05-07T20:32:37.5052915Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5053001Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5053074Z     
2025-05-07T20:32:37.5053164Z         if scale_ub is not None:
2025-05-07T20:32:37.5053268Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5053407Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5053479Z             )
2025-05-07T20:32:37.5053558Z         else:
2025-05-07T20:32:37.5053652Z             scale_ub_tensor = None
2025-05-07T20:32:37.5053724Z     
2025-05-07T20:32:37.5053853Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5053946Z             op = silu_mul_quant
2025-05-07T20:32:37.5054026Z             if compiled:
2025-05-07T20:32:37.5054121Z                 op = torch.compile(op)
2025-05-07T20:32:37.5054226Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5054296Z     
2025-05-07T20:32:37.5054390Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5054395Z 
2025-05-07T20:32:37.5054490Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5054661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5054766Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5054863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5055238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5055333Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5056114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5056223Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5056585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5056807Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5057153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5057249Z     kernel = self.compile(
2025-05-07T20:32:37.5057635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5057818Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5058097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5058104Z 
2025-05-07T20:32:37.5058332Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acf50310>
2025-05-07T20:32:37.5059298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5059809Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad76b490>}
2025-05-07T20:32:37.5060576Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5060833Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad228d70>
2025-05-07T20:32:37.5060838Z 
2025-05-07T20:32:37.5061007Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5061274Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5061395Z                            module_map=module_map)
2025-05-07T20:32:37.5061558Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5061654Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5061733Z E       ^
2025-05-07T20:32:37.5062091Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5062096Z 
2025-05-07T20:32:37.5062517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5062524Z 
2025-05-07T20:32:37.5062629Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5062852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5062928Z     T=1,
2025-05-07T20:32:37.5062999Z     D=7168,
2025-05-07T20:32:37.5063079Z     scale_ub=None,
2025-05-07T20:32:37.5063167Z     contiguous=False,
2025-05-07T20:32:37.5063251Z     compiled=False,
2025-05-07T20:32:37.5063318Z )
2025-05-07T20:32:37.5063538Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5063708Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.5063713Z 
2025-05-07T20:32:37.5063786Z     @given(
2025-05-07T20:32:37.5063904Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5064064Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5064183Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5064299Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5064420Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5064501Z     )
2025-05-07T20:32:37.5064748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5064840Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5064913Z         self,
2025-05-07T20:32:37.5064993Z         T: int,
2025-05-07T20:32:37.5065066Z         D: int,
2025-05-07T20:32:37.5065161Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5065251Z         contiguous: bool,
2025-05-07T20:32:37.5065333Z         compiled: bool,
2025-05-07T20:32:37.5065410Z     ) -> None:
2025-05-07T20:32:37.5065506Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5065573Z     
2025-05-07T20:32:37.5065745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5065817Z     
2025-05-07T20:32:37.5065905Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5066030Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5066118Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5066194Z         x0 = x[:, :D]
2025-05-07T20:32:37.5066318Z         x1 = x[:, D:]
2025-05-07T20:32:37.5066387Z     
2025-05-07T20:32:37.5066466Z         if contiguous:
2025-05-07T20:32:37.5066555Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5066639Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5066751Z     
2025-05-07T20:32:37.5066842Z         if scale_ub is not None:
2025-05-07T20:32:37.5066946Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5067082Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5067154Z             )
2025-05-07T20:32:37.5067225Z         else:
2025-05-07T20:32:37.5067323Z             scale_ub_tensor = None
2025-05-07T20:32:37.5067390Z     
2025-05-07T20:32:37.5067521Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5067611Z             op = silu_mul_quant
2025-05-07T20:32:37.5067690Z             if compiled:
2025-05-07T20:32:37.5067833Z                 op = torch.compile(op)
2025-05-07T20:32:37.5067940Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5068011Z     
2025-05-07T20:32:37.5068097Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5068102Z 
2025-05-07T20:32:37.5068203Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5068327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5068436Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5068552Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5069090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5069187Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5069550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5069835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5070233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5070325Z     kernel = self.compile(
2025-05-07T20:32:37.5070712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5070889Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5071010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5071014Z 
2025-05-07T20:32:37.5071222Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad26bac0>
2025-05-07T20:32:37.5072060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5072576Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad76b7f0>}
2025-05-07T20:32:37.5073330Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5073528Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad2aa930>
2025-05-07T20:32:37.5073533Z 
2025-05-07T20:32:37.5073695Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5073960Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5074069Z                            module_map=module_map)
2025-05-07T20:32:37.5074230Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5074327Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5074405Z E       ^
2025-05-07T20:32:37.5074803Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5074808Z 
2025-05-07T20:32:37.5075227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5075232Z 
2025-05-07T20:32:37.5075374Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5075596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5075671Z     T=2048,
2025-05-07T20:32:37.5075743Z     D=7168,
2025-05-07T20:32:37.5075821Z     scale_ub=None,
2025-05-07T20:32:37.5075908Z     contiguous=False,
2025-05-07T20:32:37.5075988Z     compiled=True,
2025-05-07T20:32:37.5076058Z )
2025-05-07T20:32:37.5076277Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5076450Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.5076497Z 
2025-05-07T20:32:37.5076574Z     @given(
2025-05-07T20:32:37.5076690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5076788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5076903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5077018Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5077131Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5077205Z     )
2025-05-07T20:32:37.5077448Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5077543Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5077614Z         self,
2025-05-07T20:32:37.5077686Z         T: int,
2025-05-07T20:32:37.5077761Z         D: int,
2025-05-07T20:32:37.5077855Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5077944Z         contiguous: bool,
2025-05-07T20:32:37.5078030Z         compiled: bool,
2025-05-07T20:32:37.5078104Z     ) -> None:
2025-05-07T20:32:37.5078201Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5078275Z     
2025-05-07T20:32:37.5078445Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5078515Z     
2025-05-07T20:32:37.5078608Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5078729Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5078820Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5078897Z         x0 = x[:, :D]
2025-05-07T20:32:37.5078974Z         x1 = x[:, D:]
2025-05-07T20:32:37.5079045Z     
2025-05-07T20:32:37.5079123Z         if contiguous:
2025-05-07T20:32:37.5079210Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5079296Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5079363Z     
2025-05-07T20:32:37.5079451Z         if scale_ub is not None:
2025-05-07T20:32:37.5079627Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5079761Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5079831Z             )
2025-05-07T20:32:37.5079911Z         else:
2025-05-07T20:32:37.5080003Z             scale_ub_tensor = None
2025-05-07T20:32:37.5080072Z     
2025-05-07T20:32:37.5080204Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5080291Z             op = silu_mul_quant
2025-05-07T20:32:37.5080374Z             if compiled:
2025-05-07T20:32:37.5080472Z                 op = torch.compile(op)
2025-05-07T20:32:37.5080575Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5080645Z     
2025-05-07T20:32:37.5080732Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5080736Z 
2025-05-07T20:32:37.5080829Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5080957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5081054Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5081154Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5081527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5081621Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5082166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5082262Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5082621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5082885Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5083227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5083321Z     kernel = self.compile(
2025-05-07T20:32:37.5083708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5083883Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5084050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5084054Z 
2025-05-07T20:32:37.5084262Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad219270>
2025-05-07T20:32:37.5085048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5085560Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad24caf0>}
2025-05-07T20:32:37.5086318Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5086514Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad3a3670>
2025-05-07T20:32:37.5086521Z 
2025-05-07T20:32:37.5086687Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5086954Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5087058Z                            module_map=module_map)
2025-05-07T20:32:37.5087221Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5087319Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5087390Z E       ^
2025-05-07T20:32:37.5087747Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5087755Z 
2025-05-07T20:32:37.5088213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5088218Z 
2025-05-07T20:32:37.5088321Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5088545Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5088619Z     T=4096,
2025-05-07T20:32:37.5088691Z     D=7168,
2025-05-07T20:32:37.5088775Z     scale_ub=None,
2025-05-07T20:32:37.5088857Z     contiguous=False,
2025-05-07T20:32:37.5088936Z     compiled=True,
2025-05-07T20:32:37.5089007Z )
2025-05-07T20:32:37.5089224Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5089400Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.5089405Z 
2025-05-07T20:32:37.5089477Z     @given(
2025-05-07T20:32:37.5089592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5089691Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5089804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5089922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5090037Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5090105Z     )
2025-05-07T20:32:37.5090351Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5090489Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5090562Z         self,
2025-05-07T20:32:37.5090637Z         T: int,
2025-05-07T20:32:37.5090708Z         D: int,
2025-05-07T20:32:37.5090803Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5090932Z         contiguous: bool,
2025-05-07T20:32:37.5091015Z         compiled: bool,
2025-05-07T20:32:37.5091088Z     ) -> None:
2025-05-07T20:32:37.5091182Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5091249Z     
2025-05-07T20:32:37.5091416Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5091487Z     
2025-05-07T20:32:37.5091574Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5091699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5091788Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5091864Z         x0 = x[:, :D]
2025-05-07T20:32:37.5091986Z         x1 = x[:, D:]
2025-05-07T20:32:37.5092054Z     
2025-05-07T20:32:37.5092131Z         if contiguous:
2025-05-07T20:32:37.5092224Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5092309Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5092375Z     
2025-05-07T20:32:37.5092466Z         if scale_ub is not None:
2025-05-07T20:32:37.5092568Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5092704Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5092777Z             )
2025-05-07T20:32:37.5092852Z         else:
2025-05-07T20:32:37.5092943Z             scale_ub_tensor = None
2025-05-07T20:32:37.5093017Z     
2025-05-07T20:32:37.5093146Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5093231Z             op = silu_mul_quant
2025-05-07T20:32:37.5093317Z             if compiled:
2025-05-07T20:32:37.5093412Z                 op = torch.compile(op)
2025-05-07T20:32:37.5093519Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5093589Z     
2025-05-07T20:32:37.5093675Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5093680Z 
2025-05-07T20:32:37.5093779Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5093902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5093999Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5094101Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5094474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5094565Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5095065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5095158Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5095567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5095793Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5096137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5096233Z     kernel = self.compile(
2025-05-07T20:32:37.5096618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5096799Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5096920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5096925Z 
2025-05-07T20:32:37.5097131Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad799f60>
2025-05-07T20:32:37.5097921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5098582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad24c280>}
2025-05-07T20:32:37.5099363Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5099593Z context = <triton._C.libtriton.ir.context object at 0x7fa7ad3ae7b0>
2025-05-07T20:32:37.5099598Z 
2025-05-07T20:32:37.5099766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5100031Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5100142Z                            module_map=module_map)
2025-05-07T20:32:37.5100304Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5100400Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5100514Z E       ^
2025-05-07T20:32:37.5100881Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5100886Z 
2025-05-07T20:32:37.5101303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5101310Z 
2025-05-07T20:32:37.5101416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5101637Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5101709Z     T=16384,
2025-05-07T20:32:37.5101783Z     D=5120,
2025-05-07T20:32:37.5101863Z     scale_ub=1200.0,
2025-05-07T20:32:37.5101944Z     contiguous=False,
2025-05-07T20:32:37.5102027Z     compiled=False,
2025-05-07T20:32:37.5102093Z )
2025-05-07T20:32:37.5102313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5102496Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.5102504Z 
2025-05-07T20:32:37.5102578Z     @given(
2025-05-07T20:32:37.5102697Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5102794Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5102906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5103026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5103136Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5103205Z     )
2025-05-07T20:32:37.5103454Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5103544Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5103615Z         self,
2025-05-07T20:32:37.5103688Z         T: int,
2025-05-07T20:32:37.5103758Z         D: int,
2025-05-07T20:32:37.5103901Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5103988Z         contiguous: bool,
2025-05-07T20:32:37.5104068Z         compiled: bool,
2025-05-07T20:32:37.5104147Z     ) -> None:
2025-05-07T20:32:37.5104238Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5104304Z     
2025-05-07T20:32:37.5104480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5104551Z     
2025-05-07T20:32:37.5104639Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5104766Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5104852Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5104926Z         x0 = x[:, :D]
2025-05-07T20:32:37.5105008Z         x1 = x[:, D:]
2025-05-07T20:32:37.5105074Z     
2025-05-07T20:32:37.5105158Z         if contiguous:
2025-05-07T20:32:37.5105245Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5105329Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5105400Z     
2025-05-07T20:32:37.5105491Z         if scale_ub is not None:
2025-05-07T20:32:37.5105592Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5105725Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5105799Z             )
2025-05-07T20:32:37.5105876Z         else:
2025-05-07T20:32:37.5106011Z             scale_ub_tensor = None
2025-05-07T20:32:37.5106081Z     
2025-05-07T20:32:37.5106211Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5106296Z             op = silu_mul_quant
2025-05-07T20:32:37.5106417Z             if compiled:
2025-05-07T20:32:37.5106518Z                 op = torch.compile(op)
2025-05-07T20:32:37.5106621Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5106689Z     
2025-05-07T20:32:37.5106781Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5106785Z 
2025-05-07T20:32:37.5106880Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5107008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5107109Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5107206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5107713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5107876Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5108236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5108477Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5108859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5108954Z     kernel = self.compile(
2025-05-07T20:32:37.5109338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5109513Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5109636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5109641Z 
2025-05-07T20:32:37.5109847Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad3bc070>
2025-05-07T20:32:37.5110636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5111146Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad24ed40>}
2025-05-07T20:32:37.5111900Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5112137Z context = <triton._C.libtriton.ir.context object at 0x7fa7ace13b70>
2025-05-07T20:32:37.5112142Z 
2025-05-07T20:32:37.5112307Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5112578Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5112684Z                            module_map=module_map)
2025-05-07T20:32:37.5112844Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5112943Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5113019Z E       ^
2025-05-07T20:32:37.5113376Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5113385Z 
2025-05-07T20:32:37.5113800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5113805Z 
2025-05-07T20:32:37.5113906Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5114134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5114207Z     T=16384,
2025-05-07T20:32:37.5114277Z     D=5120,
2025-05-07T20:32:37.5114360Z     scale_ub=1200.0,
2025-05-07T20:32:37.5114441Z     contiguous=True,
2025-05-07T20:32:37.5114563Z     compiled=True,
2025-05-07T20:32:37.5114636Z )
2025-05-07T20:32:37.5114854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5115030Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.5115073Z 
2025-05-07T20:32:37.5115145Z     @given(
2025-05-07T20:32:37.5115262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5115362Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5115475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5115590Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5115704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5115777Z     )
2025-05-07T20:32:37.5116022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5116168Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5116239Z         self,
2025-05-07T20:32:37.5116312Z         T: int,
2025-05-07T20:32:37.5116386Z         D: int,
2025-05-07T20:32:37.5116481Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5116570Z         contiguous: bool,
2025-05-07T20:32:37.5116652Z         compiled: bool,
2025-05-07T20:32:37.5116729Z     ) -> None:
2025-05-07T20:32:37.5116825Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5116892Z     
2025-05-07T20:32:37.5117059Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5117129Z     
2025-05-07T20:32:37.5117218Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5117340Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5117427Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5117503Z         x0 = x[:, :D]
2025-05-07T20:32:37.5117581Z         x1 = x[:, D:]
2025-05-07T20:32:37.5117649Z     
2025-05-07T20:32:37.5117728Z         if contiguous:
2025-05-07T20:32:37.5117819Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5117903Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5117973Z     
2025-05-07T20:32:37.5118062Z         if scale_ub is not None:
2025-05-07T20:32:37.5118164Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5118297Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5118372Z             )
2025-05-07T20:32:37.5118443Z         else:
2025-05-07T20:32:37.5118533Z             scale_ub_tensor = None
2025-05-07T20:32:37.5118605Z     
2025-05-07T20:32:37.5118734Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5118822Z             op = silu_mul_quant
2025-05-07T20:32:37.5118902Z             if compiled:
2025-05-07T20:32:37.5118999Z                 op = torch.compile(op)
2025-05-07T20:32:37.5119149Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5119218Z     
2025-05-07T20:32:37.5119304Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5119309Z 
2025-05-07T20:32:37.5119409Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5119536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5119635Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5119735Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5120103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5120200Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5120699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5120792Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5121159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5121381Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5121724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5121863Z     kernel = self.compile(
2025-05-07T20:32:37.5122247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5122423Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5122584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5122589Z 
2025-05-07T20:32:37.5122795Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad2d1060>
2025-05-07T20:32:37.5123581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5124089Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad24e830>}
2025-05-07T20:32:37.5124887Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5125081Z context = <triton._C.libtriton.ir.context object at 0x7fa7ace4cd30>
2025-05-07T20:32:37.5125085Z 
2025-05-07T20:32:37.5125256Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5125519Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5125624Z                            module_map=module_map)
2025-05-07T20:32:37.5125788Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5125886Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5125959Z E       ^
2025-05-07T20:32:37.5126317Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5126325Z 
2025-05-07T20:32:37.5126742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5126747Z 
2025-05-07T20:32:37.5126851Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5127075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5127147Z     T=16384,
2025-05-07T20:32:37.5127221Z     D=5120,
2025-05-07T20:32:37.5127298Z     scale_ub=None,
2025-05-07T20:32:37.5127379Z     contiguous=False,
2025-05-07T20:32:37.5127461Z     compiled=True,
2025-05-07T20:32:37.5127530Z )
2025-05-07T20:32:37.5127746Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5127964Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.5127969Z 
2025-05-07T20:32:37.5128040Z     @given(
2025-05-07T20:32:37.5128162Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5128261Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5128375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5128494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5128604Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5128676Z     )
2025-05-07T20:32:37.5128923Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5129014Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5129087Z         self,
2025-05-07T20:32:37.5129158Z         T: int,
2025-05-07T20:32:37.5129229Z         D: int,
2025-05-07T20:32:37.5129326Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5129412Z         contiguous: bool,
2025-05-07T20:32:37.5129497Z         compiled: bool,
2025-05-07T20:32:37.5129573Z     ) -> None:
2025-05-07T20:32:37.5129664Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5129735Z     
2025-05-07T20:32:37.5129905Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5130016Z     
2025-05-07T20:32:37.5130105Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5130232Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5130316Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5130432Z         x0 = x[:, :D]
2025-05-07T20:32:37.5130540Z         x1 = x[:, D:]
2025-05-07T20:32:37.5130643Z     
2025-05-07T20:32:37.5130763Z         if contiguous:
2025-05-07T20:32:37.5130883Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5131002Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5131106Z     
2025-05-07T20:32:37.5131225Z         if scale_ub is not None:
2025-05-07T20:32:37.5131328Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5131469Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5131539Z             )
2025-05-07T20:32:37.5131612Z         else:
2025-05-07T20:32:37.5131763Z             scale_ub_tensor = None
2025-05-07T20:32:37.5131831Z     
2025-05-07T20:32:37.5131963Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5132051Z             op = silu_mul_quant
2025-05-07T20:32:37.5132130Z             if compiled:
2025-05-07T20:32:37.5132228Z                 op = torch.compile(op)
2025-05-07T20:32:37.5132335Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5132404Z     
2025-05-07T20:32:37.5132493Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5132497Z 
2025-05-07T20:32:37.5132594Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5132718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5132819Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5132915Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5133294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5133394Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5133903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5134001Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5134361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5134585Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5134929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5135020Z     kernel = self.compile(
2025-05-07T20:32:37.5135409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5135629Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5135752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5135760Z 
2025-05-07T20:32:37.5135969Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ad31ca60>
2025-05-07T20:32:37.5136757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5137275Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ad24f760>}
2025-05-07T20:32:37.5138098Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5138295Z context = <triton._C.libtriton.ir.context object at 0x7fa7acd9f1b0>
2025-05-07T20:32:37.5138299Z 
2025-05-07T20:32:37.5138471Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5138783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5138892Z                            module_map=module_map)
2025-05-07T20:32:37.5139053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5139213Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5139289Z E       ^
2025-05-07T20:32:37.5139647Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5139651Z 
2025-05-07T20:32:37.5140069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5140078Z 
2025-05-07T20:32:37.5140183Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5140404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5140519Z     T=2048,
2025-05-07T20:32:37.5140591Z     D=5120,
2025-05-07T20:32:37.5140667Z     scale_ub=None,
2025-05-07T20:32:37.5140758Z     contiguous=False,
2025-05-07T20:32:37.5140842Z     compiled=True,
2025-05-07T20:32:37.5140910Z )
2025-05-07T20:32:37.5141130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5141305Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.5141309Z 
2025-05-07T20:32:37.5141384Z     @given(
2025-05-07T20:32:37.5141502Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5141598Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5141716Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5141832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5141944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5142017Z     )
2025-05-07T20:32:37.5142261Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5142355Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5142432Z         self,
2025-05-07T20:32:37.5142505Z         T: int,
2025-05-07T20:32:37.5142575Z         D: int,
2025-05-07T20:32:37.5142673Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5142758Z         contiguous: bool,
2025-05-07T20:32:37.5142843Z         compiled: bool,
2025-05-07T20:32:37.5142917Z     ) -> None:
2025-05-07T20:32:37.5143008Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5143078Z     
2025-05-07T20:32:37.5143245Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5143320Z     
2025-05-07T20:32:37.5143412Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5143533Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5143663Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5143748Z         x0 = x[:, :D]
2025-05-07T20:32:37.5143823Z         x1 = x[:, D:]
2025-05-07T20:32:37.5143894Z     
2025-05-07T20:32:37.5143977Z         if contiguous:
2025-05-07T20:32:37.5144064Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5144155Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5144224Z     
2025-05-07T20:32:37.5144311Z         if scale_ub is not None:
2025-05-07T20:32:37.5144419Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5144560Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5144630Z             )
2025-05-07T20:32:37.5144707Z         else:
2025-05-07T20:32:37.5144797Z             scale_ub_tensor = None
2025-05-07T20:32:37.5144864Z     
2025-05-07T20:32:37.5144996Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5145082Z             op = silu_mul_quant
2025-05-07T20:32:37.5145161Z             if compiled:
2025-05-07T20:32:37.5145262Z                 op = torch.compile(op)
2025-05-07T20:32:37.5145364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5145436Z     
2025-05-07T20:32:37.5145525Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5145530Z 
2025-05-07T20:32:37.5145624Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5145797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5145899Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5145996Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5146409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5146499Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5146997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5147095Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5147457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5147684Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5148069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5148158Z     kernel = self.compile(
2025-05-07T20:32:37.5148550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5148728Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5148851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5148856Z 
2025-05-07T20:32:37.5149066Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acef13f0>
2025-05-07T20:32:37.5149857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5150373Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acdd43a0>}
2025-05-07T20:32:37.5151132Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5151330Z context = <triton._C.libtriton.ir.context object at 0x7fa7acd61e70>
2025-05-07T20:32:37.5151335Z 
2025-05-07T20:32:37.5151498Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5151762Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5151870Z                            module_map=module_map)
2025-05-07T20:32:37.5152075Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5152180Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5152252Z E       ^
2025-05-07T20:32:37.5152614Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5152618Z 
2025-05-07T20:32:37.5153038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5153045Z 
2025-05-07T20:32:37.5153146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5153371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5153445Z     T=2048,
2025-05-07T20:32:37.5153516Z     D=5120,
2025-05-07T20:32:37.5153598Z     scale_ub=1200.0,
2025-05-07T20:32:37.5153680Z     contiguous=False,
2025-05-07T20:32:37.5153759Z     compiled=True,
2025-05-07T20:32:37.5153830Z )
2025-05-07T20:32:37.5154051Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5154224Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.5154230Z 
2025-05-07T20:32:37.5154305Z     @given(
2025-05-07T20:32:37.5154463Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5154571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5154685Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5154800Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5154951Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5155020Z     )
2025-05-07T20:32:37.5155265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5155359Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5155430Z         self,
2025-05-07T20:32:37.5155501Z         T: int,
2025-05-07T20:32:37.5156041Z         D: int,
2025-05-07T20:32:37.5156149Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5156240Z         contiguous: bool,
2025-05-07T20:32:37.5156327Z         compiled: bool,
2025-05-07T20:32:37.5156401Z     ) -> None:
2025-05-07T20:32:37.5156597Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5156664Z     
2025-05-07T20:32:37.5156839Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5156911Z     
2025-05-07T20:32:37.5156999Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5157121Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5157213Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5157288Z         x0 = x[:, :D]
2025-05-07T20:32:37.5157364Z         x1 = x[:, D:]
2025-05-07T20:32:37.5157435Z     
2025-05-07T20:32:37.5157513Z         if contiguous:
2025-05-07T20:32:37.5157600Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5157687Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5157756Z     
2025-05-07T20:32:37.5157847Z         if scale_ub is not None:
2025-05-07T20:32:37.5157951Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5158084Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5158159Z             )
2025-05-07T20:32:37.5158234Z         else:
2025-05-07T20:32:37.5158325Z             scale_ub_tensor = None
2025-05-07T20:32:37.5158397Z     
2025-05-07T20:32:37.5158525Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5158614Z             op = silu_mul_quant
2025-05-07T20:32:37.5158698Z             if compiled:
2025-05-07T20:32:37.5158795Z                 op = torch.compile(op)
2025-05-07T20:32:37.5158898Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5158969Z     
2025-05-07T20:32:37.5159056Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5159061Z 
2025-05-07T20:32:37.5159159Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5159282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5159382Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5159550Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5159924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5160016Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5160517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5160613Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5160971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5161196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5161538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5161630Z     kernel = self.compile(
2025-05-07T20:32:37.5162015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5162188Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5162314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5162318Z 
2025-05-07T20:32:37.5162582Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acda0910>
2025-05-07T20:32:37.5163370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5163935Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acdd4820>}
2025-05-07T20:32:37.5164697Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5164890Z context = <triton._C.libtriton.ir.context object at 0x7fa7acc087b0>
2025-05-07T20:32:37.5164932Z 
2025-05-07T20:32:37.5165099Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5165369Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5165474Z                            module_map=module_map)
2025-05-07T20:32:37.5165639Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5165736Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5165810Z E       ^
2025-05-07T20:32:37.5166171Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5166176Z 
2025-05-07T20:32:37.5169832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5169840Z 
2025-05-07T20:32:37.5169957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5170187Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5170263Z     T=4096,
2025-05-07T20:32:37.5170347Z     D=5120,
2025-05-07T20:32:37.5170427Z     scale_ub=1200.0,
2025-05-07T20:32:37.5170506Z     contiguous=True,
2025-05-07T20:32:37.5170592Z     compiled=True,
2025-05-07T20:32:37.5170660Z )
2025-05-07T20:32:37.5170879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5171056Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.5171061Z 
2025-05-07T20:32:37.5171131Z     @given(
2025-05-07T20:32:37.5171252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5171348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5171461Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5171647Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5171762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5171834Z     )
2025-05-07T20:32:37.5172087Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5172179Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5172252Z         self,
2025-05-07T20:32:37.5172334Z         T: int,
2025-05-07T20:32:37.5172405Z         D: int,
2025-05-07T20:32:37.5172503Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5172591Z         contiguous: bool,
2025-05-07T20:32:37.5172673Z         compiled: bool,
2025-05-07T20:32:37.5172752Z     ) -> None:
2025-05-07T20:32:37.5172847Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5172915Z     
2025-05-07T20:32:37.5173086Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5173155Z     
2025-05-07T20:32:37.5173244Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5173375Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5173459Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5173536Z         x0 = x[:, :D]
2025-05-07T20:32:37.5173619Z         x1 = x[:, D:]
2025-05-07T20:32:37.5173688Z     
2025-05-07T20:32:37.5173769Z         if contiguous:
2025-05-07T20:32:37.5173937Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5174025Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5174097Z     
2025-05-07T20:32:37.5174185Z         if scale_ub is not None:
2025-05-07T20:32:37.5174364Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5174502Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5174572Z             )
2025-05-07T20:32:37.5174642Z         else:
2025-05-07T20:32:37.5174737Z             scale_ub_tensor = None
2025-05-07T20:32:37.5174807Z     
2025-05-07T20:32:37.5174940Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5175030Z             op = silu_mul_quant
2025-05-07T20:32:37.5175114Z             if compiled:
2025-05-07T20:32:37.5175211Z                 op = torch.compile(op)
2025-05-07T20:32:37.5175320Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5175432Z     
2025-05-07T20:32:37.5175524Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5175531Z 
2025-05-07T20:32:37.5175627Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5175755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5175856Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5175958Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5176331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5176423Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5176922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5177021Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5177378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5177604Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5177954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5178149Z     kernel = self.compile(
2025-05-07T20:32:37.5178535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5178716Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5178837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5178842Z 
2025-05-07T20:32:37.5179051Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acddc850>
2025-05-07T20:32:37.5179883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5180405Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acdd5360>}
2025-05-07T20:32:37.5181163Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5181356Z context = <triton._C.libtriton.ir.context object at 0x7fa7acc87670>
2025-05-07T20:32:37.5181361Z 
2025-05-07T20:32:37.5181527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5181794Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5181905Z                            module_map=module_map)
2025-05-07T20:32:37.5182071Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5182171Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5182245Z E       ^
2025-05-07T20:32:37.5182642Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5182647Z 
2025-05-07T20:32:37.5183066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5183111Z 
2025-05-07T20:32:37.5183213Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5183436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5183510Z     T=128,
2025-05-07T20:32:37.5183582Z     D=5120,
2025-05-07T20:32:37.5183663Z     scale_ub=1200.0,
2025-05-07T20:32:37.5183750Z     contiguous=False,
2025-05-07T20:32:37.5183831Z     compiled=True,
2025-05-07T20:32:37.5183903Z )
2025-05-07T20:32:37.5184126Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5184295Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.5184345Z 
2025-05-07T20:32:37.5184420Z     @given(
2025-05-07T20:32:37.5184539Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5184635Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5184752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5184870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5184984Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5185056Z     )
2025-05-07T20:32:37.5185302Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5185391Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5185466Z         self,
2025-05-07T20:32:37.5185539Z         T: int,
2025-05-07T20:32:37.5185614Z         D: int,
2025-05-07T20:32:37.5185712Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5185797Z         contiguous: bool,
2025-05-07T20:32:37.5185882Z         compiled: bool,
2025-05-07T20:32:37.5185958Z     ) -> None:
2025-05-07T20:32:37.5186051Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5186123Z     
2025-05-07T20:32:37.5186291Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5186359Z     
2025-05-07T20:32:37.5186449Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5186578Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5186664Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5186739Z         x0 = x[:, :D]
2025-05-07T20:32:37.5186819Z         x1 = x[:, D:]
2025-05-07T20:32:37.5186886Z     
2025-05-07T20:32:37.5186965Z         if contiguous:
2025-05-07T20:32:37.5187060Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5187144Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5187212Z     
2025-05-07T20:32:37.5187350Z         if scale_ub is not None:
2025-05-07T20:32:37.5187454Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5187592Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5187664Z             )
2025-05-07T20:32:37.5187736Z         else:
2025-05-07T20:32:37.5187832Z             scale_ub_tensor = None
2025-05-07T20:32:37.5187899Z     
2025-05-07T20:32:37.5188027Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5188115Z             op = silu_mul_quant
2025-05-07T20:32:37.5188197Z             if compiled:
2025-05-07T20:32:37.5188293Z                 op = torch.compile(op)
2025-05-07T20:32:37.5188399Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5188467Z     
2025-05-07T20:32:37.5188555Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5188562Z 
2025-05-07T20:32:37.5188657Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5188781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5188887Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5188984Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5189354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5189580Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5190079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5190216Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5190575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5190796Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5191141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5191232Z     kernel = self.compile(
2025-05-07T20:32:37.5191617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5191795Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5191960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5191965Z 
2025-05-07T20:32:37.5192173Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7accc5300>
2025-05-07T20:32:37.5192956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5193467Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acdd6290>}
2025-05-07T20:32:37.5194225Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5194421Z context = <triton._C.libtriton.ir.context object at 0x7fa7accaddf0>
2025-05-07T20:32:37.5194426Z 
2025-05-07T20:32:37.5194593Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5194856Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5194963Z                            module_map=module_map)
2025-05-07T20:32:37.5195124Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5195220Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5195295Z E       ^
2025-05-07T20:32:37.5195651Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5195656Z 
2025-05-07T20:32:37.5196117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5196122Z 
2025-05-07T20:32:37.5196229Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5196454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5196533Z     T=16384,
2025-05-07T20:32:37.5196607Z     D=7168,
2025-05-07T20:32:37.5196688Z     scale_ub=1200.0,
2025-05-07T20:32:37.5196774Z     contiguous=True,
2025-05-07T20:32:37.5196854Z     compiled=True,
2025-05-07T20:32:37.5196926Z )
2025-05-07T20:32:37.5197147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5197322Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.5197327Z 
2025-05-07T20:32:37.5197400Z     @given(
2025-05-07T20:32:37.5197523Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5197621Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5197742Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5197859Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5197972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5198047Z     )
2025-05-07T20:32:37.5198337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5198429Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5198506Z         self,
2025-05-07T20:32:37.5198579Z         T: int,
2025-05-07T20:32:37.5198693Z         D: int,
2025-05-07T20:32:37.5198792Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5198880Z         contiguous: bool,
2025-05-07T20:32:37.5198962Z         compiled: bool,
2025-05-07T20:32:37.5199042Z     ) -> None:
2025-05-07T20:32:37.5199135Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5199207Z     
2025-05-07T20:32:37.5199375Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5199444Z     
2025-05-07T20:32:37.5199540Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5199663Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5199748Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5199873Z         x0 = x[:, :D]
2025-05-07T20:32:37.5199949Z         x1 = x[:, D:]
2025-05-07T20:32:37.5200023Z     
2025-05-07T20:32:37.5200107Z         if contiguous:
2025-05-07T20:32:37.5200196Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5200283Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5200355Z     
2025-05-07T20:32:37.5200446Z         if scale_ub is not None:
2025-05-07T20:32:37.5200551Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5200690Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5200764Z             )
2025-05-07T20:32:37.5200838Z         else:
2025-05-07T20:32:37.5200930Z             scale_ub_tensor = None
2025-05-07T20:32:37.5200998Z     
2025-05-07T20:32:37.5201131Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5201221Z             op = silu_mul_quant
2025-05-07T20:32:37.5201303Z             if compiled:
2025-05-07T20:32:37.5201404Z                 op = torch.compile(op)
2025-05-07T20:32:37.5201511Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5201582Z     
2025-05-07T20:32:37.5201678Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5201682Z 
2025-05-07T20:32:37.5201779Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5201908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5202008Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5202106Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5202480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5202571Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5203137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5203239Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5203599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5203828Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5204172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5204262Z     kernel = self.compile(
2025-05-07T20:32:37.5204652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5204828Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5204951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5204959Z 
2025-05-07T20:32:37.5205165Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acc57fa0>
2025-05-07T20:32:37.5205950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5206509Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acdd6d40>}
2025-05-07T20:32:37.5207266Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5207503Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac9b4870>
2025-05-07T20:32:37.5207507Z 
2025-05-07T20:32:37.5207671Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5207939Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5208048Z                            module_map=module_map)
2025-05-07T20:32:37.5208209Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5208350Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5208426Z E       ^
2025-05-07T20:32:37.5208784Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5208788Z 
2025-05-07T20:32:37.5209208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5209216Z 
2025-05-07T20:32:37.5209319Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5209543Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5209620Z     T=16384,
2025-05-07T20:32:37.5209693Z     D=5120,
2025-05-07T20:32:37.5209775Z     scale_ub=1200.0,
2025-05-07T20:32:37.5209860Z     contiguous=True,
2025-05-07T20:32:37.5209941Z     compiled=False,
2025-05-07T20:32:37.5210015Z )
2025-05-07T20:32:37.5210235Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5210415Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.5210421Z 
2025-05-07T20:32:37.5210495Z     @given(
2025-05-07T20:32:37.5210612Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5210709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5210827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5210945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5211061Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5211132Z     )
2025-05-07T20:32:37.5211379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5211473Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5211547Z         self,
2025-05-07T20:32:37.5211664Z         T: int,
2025-05-07T20:32:37.5211741Z         D: int,
2025-05-07T20:32:37.5211838Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5211928Z         contiguous: bool,
2025-05-07T20:32:37.5212014Z         compiled: bool,
2025-05-07T20:32:37.5212088Z     ) -> None:
2025-05-07T20:32:37.5212183Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5212255Z     
2025-05-07T20:32:37.5212425Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5212499Z     
2025-05-07T20:32:37.5212592Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5212717Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5212807Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5212886Z         x0 = x[:, :D]
2025-05-07T20:32:37.5212961Z         x1 = x[:, D:]
2025-05-07T20:32:37.5213032Z     
2025-05-07T20:32:37.5213113Z         if contiguous:
2025-05-07T20:32:37.5213201Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5213294Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5213362Z     
2025-05-07T20:32:37.5213451Z         if scale_ub is not None:
2025-05-07T20:32:37.5213558Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5213695Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5213810Z             )
2025-05-07T20:32:37.5213888Z         else:
2025-05-07T20:32:37.5213979Z             scale_ub_tensor = None
2025-05-07T20:32:37.5214051Z     
2025-05-07T20:32:37.5214181Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5214308Z             op = silu_mul_quant
2025-05-07T20:32:37.5214394Z             if compiled:
2025-05-07T20:32:37.5214492Z                 op = torch.compile(op)
2025-05-07T20:32:37.5214596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5214667Z     
2025-05-07T20:32:37.5214756Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5214761Z 
2025-05-07T20:32:37.5214856Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5214990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5215091Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5215191Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5215740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5215836Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5216198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5216424Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5216767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5216860Z     kernel = self.compile(
2025-05-07T20:32:37.5217247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5217425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5217545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5217553Z 
2025-05-07T20:32:37.5217763Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acc54640>
2025-05-07T20:32:37.5218637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5219152Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acdd7ac0>}
2025-05-07T20:32:37.5219955Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5220149Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac93db70>
2025-05-07T20:32:37.5220157Z 
2025-05-07T20:32:37.5220325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5220595Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5220701Z                            module_map=module_map)
2025-05-07T20:32:37.5220867Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5220967Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5221040Z E       ^
2025-05-07T20:32:37.5221402Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5221406Z 
2025-05-07T20:32:37.5221821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5221825Z 
2025-05-07T20:32:37.5221933Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5222156Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5222231Z     T=1,
2025-05-07T20:32:37.5222308Z     D=7168,
2025-05-07T20:32:37.5222430Z     scale_ub=1200.0,
2025-05-07T20:32:37.5222517Z     contiguous=False,
2025-05-07T20:32:37.5222601Z     compiled=False,
2025-05-07T20:32:37.5222670Z )
2025-05-07T20:32:37.5222887Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5223097Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.5223102Z 
2025-05-07T20:32:37.5223173Z     @given(
2025-05-07T20:32:37.5223296Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5223394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5223510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5223634Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5223748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5223817Z     )
2025-05-07T20:32:37.5224065Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5224196Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5224276Z         self,
2025-05-07T20:32:37.5224350Z         T: int,
2025-05-07T20:32:37.5224426Z         D: int,
2025-05-07T20:32:37.5224525Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5224612Z         contiguous: bool,
2025-05-07T20:32:37.5224697Z         compiled: bool,
2025-05-07T20:32:37.5224773Z     ) -> None:
2025-05-07T20:32:37.5224865Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5224934Z     
2025-05-07T20:32:37.5225106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5225176Z     
2025-05-07T20:32:37.5225264Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5225390Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5225479Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5225555Z         x0 = x[:, :D]
2025-05-07T20:32:37.5225638Z         x1 = x[:, D:]
2025-05-07T20:32:37.5225709Z     
2025-05-07T20:32:37.5225791Z         if contiguous:
2025-05-07T20:32:37.5225879Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5225966Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5226039Z     
2025-05-07T20:32:37.5226132Z         if scale_ub is not None:
2025-05-07T20:32:37.5226236Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5226376Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5226449Z             )
2025-05-07T20:32:37.5226521Z         else:
2025-05-07T20:32:37.5226619Z             scale_ub_tensor = None
2025-05-07T20:32:37.5226687Z     
2025-05-07T20:32:37.5226819Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5226906Z             op = silu_mul_quant
2025-05-07T20:32:37.5226987Z             if compiled:
2025-05-07T20:32:37.5227137Z                 op = torch.compile(op)
2025-05-07T20:32:37.5227244Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5227313Z     
2025-05-07T20:32:37.5227406Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5227411Z 
2025-05-07T20:32:37.5227509Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5227636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5227735Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5227833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5228342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5228436Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5228796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5229025Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5229370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5229468Z     kernel = self.compile(
2025-05-07T20:32:37.5229896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5230072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5230198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5230241Z 
2025-05-07T20:32:37.5230447Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac8657e0>
2025-05-07T20:32:37.5231238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5231751Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac9744c0>}
2025-05-07T20:32:37.5232571Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5232772Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac87d670>
2025-05-07T20:32:37.5232776Z 
2025-05-07T20:32:37.5232944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5233217Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5233325Z                            module_map=module_map)
2025-05-07T20:32:37.5233486Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5233588Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5233662Z E       ^
2025-05-07T20:32:37.5234023Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5234031Z 
2025-05-07T20:32:37.5234449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5234454Z 
2025-05-07T20:32:37.5234558Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5234785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5234862Z     T=4096,
2025-05-07T20:32:37.5234935Z     D=7168,
2025-05-07T20:32:37.5235021Z     scale_ub=1200.0,
2025-05-07T20:32:37.5235105Z     contiguous=False,
2025-05-07T20:32:37.5235190Z     compiled=True,
2025-05-07T20:32:37.5235260Z )
2025-05-07T20:32:37.5235479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5235659Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.5235664Z 
2025-05-07T20:32:37.5235781Z     @given(
2025-05-07T20:32:37.5235904Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5236006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5236124Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5236243Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5236362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5236433Z     )
2025-05-07T20:32:37.5236684Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5236780Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5236854Z         self,
2025-05-07T20:32:37.5236931Z         T: int,
2025-05-07T20:32:37.5237005Z         D: int,
2025-05-07T20:32:37.5237104Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5237196Z         contiguous: bool,
2025-05-07T20:32:37.5237282Z         compiled: bool,
2025-05-07T20:32:37.5237358Z     ) -> None:
2025-05-07T20:32:37.5237460Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5237532Z     
2025-05-07T20:32:37.5237702Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5237777Z     
2025-05-07T20:32:37.5237868Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5238040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5238129Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5238208Z         x0 = x[:, :D]
2025-05-07T20:32:37.5238290Z         x1 = x[:, D:]
2025-05-07T20:32:37.5238401Z     
2025-05-07T20:32:37.5238482Z         if contiguous:
2025-05-07T20:32:37.5238576Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5238664Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5238734Z     
2025-05-07T20:32:37.5238828Z         if scale_ub is not None:
2025-05-07T20:32:37.5238932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5239065Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5239140Z             )
2025-05-07T20:32:37.5239217Z         else:
2025-05-07T20:32:37.5239310Z             scale_ub_tensor = None
2025-05-07T20:32:37.5239385Z     
2025-05-07T20:32:37.5239516Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5239648Z             op = silu_mul_quant
2025-05-07T20:32:37.5239734Z             if compiled:
2025-05-07T20:32:37.5239832Z                 op = torch.compile(op)
2025-05-07T20:32:37.5239940Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5240010Z     
2025-05-07T20:32:37.5240102Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5240107Z 
2025-05-07T20:32:37.5240208Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5240333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5240433Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5240536Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5240912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5241006Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5241505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5241610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5241972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5242194Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5242544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5242639Z     kernel = self.compile(
2025-05-07T20:32:37.5243024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5243203Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5243371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5243377Z 
2025-05-07T20:32:37.5243585Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac865b40>
2025-05-07T20:32:37.5244379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5244888Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac9751b0>}
2025-05-07T20:32:37.5245647Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5245842Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac82dd30>
2025-05-07T20:32:37.5245847Z 
2025-05-07T20:32:37.5246016Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5246283Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5246434Z                            module_map=module_map)
2025-05-07T20:32:37.5246600Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5246699Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5246774Z E       ^
2025-05-07T20:32:37.5247173Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5247178Z 
2025-05-07T20:32:37.5247595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5247600Z 
2025-05-07T20:32:37.5247705Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5247932Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5248008Z     T=128,
2025-05-07T20:32:37.5248089Z     D=7168,
2025-05-07T20:32:37.5248171Z     scale_ub=1200.0,
2025-05-07T20:32:37.5248296Z     contiguous=False,
2025-05-07T20:32:37.5248383Z     compiled=True,
2025-05-07T20:32:37.5248456Z )
2025-05-07T20:32:37.5248680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5248852Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.5248859Z 
2025-05-07T20:32:37.5248938Z     @given(
2025-05-07T20:32:37.5249061Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5249160Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5249274Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5249396Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5249509Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5249580Z     )
2025-05-07T20:32:37.5249829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5249924Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5250004Z         self,
2025-05-07T20:32:37.5250077Z         T: int,
2025-05-07T20:32:37.5250151Z         D: int,
2025-05-07T20:32:37.5250253Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5250341Z         contiguous: bool,
2025-05-07T20:32:37.5250426Z         compiled: bool,
2025-05-07T20:32:37.5250506Z     ) -> None:
2025-05-07T20:32:37.5250602Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5250672Z     
2025-05-07T20:32:37.5250844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5250916Z     
2025-05-07T20:32:37.5251006Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5251133Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5251221Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5251301Z         x0 = x[:, :D]
2025-05-07T20:32:37.5251423Z         x1 = x[:, D:]
2025-05-07T20:32:37.5251495Z     
2025-05-07T20:32:37.5251579Z         if contiguous:
2025-05-07T20:32:37.5251670Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5251760Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5251833Z     
2025-05-07T20:32:37.5251924Z         if scale_ub is not None:
2025-05-07T20:32:37.5252030Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5252166Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5252239Z             )
2025-05-07T20:32:37.5252317Z         else:
2025-05-07T20:32:37.5252421Z             scale_ub_tensor = None
2025-05-07T20:32:37.5252523Z     
2025-05-07T20:32:37.5252706Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5252847Z             op = silu_mul_quant
2025-05-07T20:32:37.5252935Z             if compiled:
2025-05-07T20:32:37.5253039Z                 op = torch.compile(op)
2025-05-07T20:32:37.5253144Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5253219Z     
2025-05-07T20:32:37.5253312Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5253316Z 
2025-05-07T20:32:37.5253413Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5253543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5253701Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5253802Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5254178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5254317Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5254817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5254917Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5255277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5255504Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5256181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5256379Z     kernel = self.compile(
2025-05-07T20:32:37.5256772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5256947Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5257071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5257076Z 
2025-05-07T20:32:37.5257286Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac817bb0>
2025-05-07T20:32:37.5258117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5258632Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac9740d0>}
2025-05-07T20:32:37.5259392Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5259586Z context = <triton._C.libtriton.ir.context object at 0x7fa7acb203f0>
2025-05-07T20:32:37.5259593Z 
2025-05-07T20:32:37.5259760Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5260024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5260132Z                            module_map=module_map)
2025-05-07T20:32:37.5260297Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5260462Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5260540Z E       ^
2025-05-07T20:32:37.5260896Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5260903Z 
2025-05-07T20:32:37.5261323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5261328Z 
2025-05-07T20:32:37.5261431Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5261653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5261733Z     T=2048,
2025-05-07T20:32:37.5261806Z     D=7168,
2025-05-07T20:32:37.5261887Z     scale_ub=None,
2025-05-07T20:32:37.5261972Z     contiguous=True,
2025-05-07T20:32:37.5262054Z     compiled=True,
2025-05-07T20:32:37.5262123Z )
2025-05-07T20:32:37.5262343Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5262517Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.5262522Z 
2025-05-07T20:32:37.5262601Z     @given(
2025-05-07T20:32:37.5262720Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5262820Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5262999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5263118Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5263230Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5263383Z     )
2025-05-07T20:32:37.5263629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5263720Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5263797Z         self,
2025-05-07T20:32:37.5263870Z         T: int,
2025-05-07T20:32:37.5263947Z         D: int,
2025-05-07T20:32:37.5264046Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5264133Z         contiguous: bool,
2025-05-07T20:32:37.5264221Z         compiled: bool,
2025-05-07T20:32:37.5264301Z     ) -> None:
2025-05-07T20:32:37.5264393Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5264466Z     
2025-05-07T20:32:37.5264634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5264749Z     
2025-05-07T20:32:37.5264849Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5264978Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5265064Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5265143Z         x0 = x[:, :D]
2025-05-07T20:32:37.5265223Z         x1 = x[:, D:]
2025-05-07T20:32:37.5265292Z     
2025-05-07T20:32:37.5265377Z         if contiguous:
2025-05-07T20:32:37.5265467Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5265557Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5265625Z     
2025-05-07T20:32:37.5265716Z         if scale_ub is not None:
2025-05-07T20:32:37.5265822Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5265958Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5266029Z             )
2025-05-07T20:32:37.5266105Z         else:
2025-05-07T20:32:37.5266197Z             scale_ub_tensor = None
2025-05-07T20:32:37.5266269Z     
2025-05-07T20:32:37.5266404Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5266494Z             op = silu_mul_quant
2025-05-07T20:32:37.5266578Z             if compiled:
2025-05-07T20:32:37.5266681Z                 op = torch.compile(op)
2025-05-07T20:32:37.5266786Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5266865Z     
2025-05-07T20:32:37.5266956Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5266961Z 
2025-05-07T20:32:37.5267057Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5267186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5267286Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5267383Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5267803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5267895Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5268403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5268501Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5268859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5269095Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5269436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5269528Z     kernel = self.compile(
2025-05-07T20:32:37.5269914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5270091Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5270216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5270223Z 
2025-05-07T20:32:37.5270430Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acb51000>
2025-05-07T20:32:37.5271255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5271806Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac976560>}
2025-05-07T20:32:37.5272561Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5272759Z context = <triton._C.libtriton.ir.context object at 0x7fa7acb7b570>
2025-05-07T20:32:37.5272763Z 
2025-05-07T20:32:37.5272929Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5273238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5273345Z                            module_map=module_map)
2025-05-07T20:32:37.5273510Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5273612Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5273687Z E       ^
2025-05-07T20:32:37.5274043Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5274048Z 
2025-05-07T20:32:37.5274467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5274472Z 
2025-05-07T20:32:37.5274578Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5274805Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5274879Z     T=16384,
2025-05-07T20:32:37.5274956Z     D=5120,
2025-05-07T20:32:37.5275040Z     scale_ub=None,
2025-05-07T20:32:37.5275129Z     contiguous=False,
2025-05-07T20:32:37.5275214Z     compiled=False,
2025-05-07T20:32:37.5275290Z )
2025-05-07T20:32:37.5275507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5275684Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.5275695Z 
2025-05-07T20:32:37.5275768Z     @given(
2025-05-07T20:32:37.5275886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5275988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5276104Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5276222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5276380Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5276454Z     )
2025-05-07T20:32:37.5276698Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5276799Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5276873Z         self,
2025-05-07T20:32:37.5276950Z         T: int,
2025-05-07T20:32:37.5277027Z         D: int,
2025-05-07T20:32:37.5277125Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5277216Z         contiguous: bool,
2025-05-07T20:32:37.5277302Z         compiled: bool,
2025-05-07T20:32:37.5277380Z     ) -> None:
2025-05-07T20:32:37.5277476Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5277545Z     
2025-05-07T20:32:37.5277715Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5277789Z     
2025-05-07T20:32:37.5277879Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5278002Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5279894Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5279940Z 
2025-05-07T20:32:37.5280059Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.5280064Z 
2025-05-07T20:32:37.5280172Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5280395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5280472Z     T=4096,
2025-05-07T20:32:37.5280546Z     D=7168,
2025-05-07T20:32:37.5280626Z     scale_ub=1200.0,
2025-05-07T20:32:37.5280710Z     contiguous=True,
2025-05-07T20:32:37.5280793Z     compiled=True,
2025-05-07T20:32:37.5280861Z )
2025-05-07T20:32:37.5281078Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5281295Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.5281302Z 
2025-05-07T20:32:37.5281376Z     @given(
2025-05-07T20:32:37.5281493Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5281590Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5281709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5281824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5281935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5282009Z     )
2025-05-07T20:32:37.5282254Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5282345Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5282421Z         self,
2025-05-07T20:32:37.5282496Z         T: int,
2025-05-07T20:32:37.5282569Z         D: int,
2025-05-07T20:32:37.5282668Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5282754Z         contiguous: bool,
2025-05-07T20:32:37.5282840Z         compiled: bool,
2025-05-07T20:32:37.5282919Z     ) -> None:
2025-05-07T20:32:37.5283014Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5283084Z     
2025-05-07T20:32:37.5283253Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5283323Z     
2025-05-07T20:32:37.5283418Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5283541Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5285414Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5285429Z 
2025-05-07T20:32:37.5285549Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.5285554Z 
2025-05-07T20:32:37.5285656Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5285882Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5285958Z     T=16384,
2025-05-07T20:32:37.5286031Z     D=7168,
2025-05-07T20:32:37.5286112Z     scale_ub=None,
2025-05-07T20:32:37.5286195Z     contiguous=False,
2025-05-07T20:32:37.5286279Z     compiled=False,
2025-05-07T20:32:37.5286348Z )
2025-05-07T20:32:37.5286565Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5286741Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.5286749Z 
2025-05-07T20:32:37.5286823Z     @given(
2025-05-07T20:32:37.5286937Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5287038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5287192Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5287310Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5287424Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5287496Z     )
2025-05-07T20:32:37.5287784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5287876Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5287949Z         self,
2025-05-07T20:32:37.5288025Z         T: int,
2025-05-07T20:32:37.5288097Z         D: int,
2025-05-07T20:32:37.5288194Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5288286Z         contiguous: bool,
2025-05-07T20:32:37.5291521Z         compiled: bool,
2025-05-07T20:32:37.5291610Z     ) -> None:
2025-05-07T20:32:37.5291716Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5291784Z     
2025-05-07T20:32:37.5291962Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5293876Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5293886Z 
2025-05-07T20:32:37.5294008Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5294012Z 
2025-05-07T20:32:37.5294112Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5294340Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5294414Z     T=2048,
2025-05-07T20:32:37.5294484Z     D=7168,
2025-05-07T20:32:37.5294565Z     scale_ub=1200.0,
2025-05-07T20:32:37.5294650Z     contiguous=True,
2025-05-07T20:32:37.5294730Z     compiled=True,
2025-05-07T20:32:37.5294798Z )
2025-05-07T20:32:37.5295018Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5295188Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.5295195Z 
2025-05-07T20:32:37.5295269Z     @given(
2025-05-07T20:32:37.5295391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5295486Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5295606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5295726Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5295837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5295953Z     )
2025-05-07T20:32:37.5296201Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5296302Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5296373Z         self,
2025-05-07T20:32:37.5296445Z         T: int,
2025-05-07T20:32:37.5296521Z         D: int,
2025-05-07T20:32:37.5296617Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5296703Z         contiguous: bool,
2025-05-07T20:32:37.5296788Z         compiled: bool,
2025-05-07T20:32:37.5296864Z     ) -> None:
2025-05-07T20:32:37.5296955Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5297031Z     
2025-05-07T20:32:37.5297197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5297267Z     
2025-05-07T20:32:37.5297359Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5297483Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5299434Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5299483Z 
2025-05-07T20:32:37.5299600Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.5299605Z 
2025-05-07T20:32:37.5299709Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5299931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5300003Z     T=2048,
2025-05-07T20:32:37.5300075Z     D=7168,
2025-05-07T20:32:37.5300153Z     scale_ub=None,
2025-05-07T20:32:37.5300237Z     contiguous=True,
2025-05-07T20:32:37.5300323Z     compiled=False,
2025-05-07T20:32:37.5300391Z )
2025-05-07T20:32:37.5300606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5300822Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.5300827Z 
2025-05-07T20:32:37.5300901Z     @given(
2025-05-07T20:32:37.5301020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5301117Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5301228Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5301351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5301462Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5301531Z     )
2025-05-07T20:32:37.5301779Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5301870Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5301942Z         self,
2025-05-07T20:32:37.5302016Z         T: int,
2025-05-07T20:32:37.5302088Z         D: int,
2025-05-07T20:32:37.5302189Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5302275Z         contiguous: bool,
2025-05-07T20:32:37.5302362Z         compiled: bool,
2025-05-07T20:32:37.5302438Z     ) -> None:
2025-05-07T20:32:37.5302532Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5302599Z     
2025-05-07T20:32:37.5302769Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5302837Z     
2025-05-07T20:32:37.5302927Z >       x_sign = torch.sign(x)
2025-05-07T20:32:37.5304795Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5304802Z 
2025-05-07T20:32:37.5304924Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:37.5304928Z 
2025-05-07T20:32:37.5305033Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5305255Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5305332Z     T=1,
2025-05-07T20:32:37.5305404Z     D=7168,
2025-05-07T20:32:37.5305487Z     scale_ub=1200.0,
2025-05-07T20:32:37.5305572Z     contiguous=True,
2025-05-07T20:32:37.5305653Z     compiled=False,
2025-05-07T20:32:37.5305722Z )
2025-05-07T20:32:37.5305940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5306104Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.5306109Z 
2025-05-07T20:32:37.5306185Z     @given(
2025-05-07T20:32:37.5306306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5306403Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5306518Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5306640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5306795Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5306873Z     )
2025-05-07T20:32:37.5307120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5307212Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5307327Z         self,
2025-05-07T20:32:37.5307400Z         T: int,
2025-05-07T20:32:37.5307472Z         D: int,
2025-05-07T20:32:37.5307573Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5307659Z         contiguous: bool,
2025-05-07T20:32:37.5307744Z         compiled: bool,
2025-05-07T20:32:37.5307822Z     ) -> None:
2025-05-07T20:32:37.5307914Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5307983Z     
2025-05-07T20:32:37.5308155Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5308228Z     
2025-05-07T20:32:37.5308322Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5308489Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5308575Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5308659Z         x0 = x[:, :D]
2025-05-07T20:32:37.5308736Z         x1 = x[:, D:]
2025-05-07T20:32:37.5308804Z     
2025-05-07T20:32:37.5308889Z         if contiguous:
2025-05-07T20:32:37.5308981Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5309072Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5309143Z     
2025-05-07T20:32:37.5309237Z         if scale_ub is not None:
2025-05-07T20:32:37.5309341Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5309477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5309553Z             )
2025-05-07T20:32:37.5309627Z         else:
2025-05-07T20:32:37.5309721Z             scale_ub_tensor = None
2025-05-07T20:32:37.5309791Z     
2025-05-07T20:32:37.5309921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5310011Z             op = silu_mul_quant
2025-05-07T20:32:37.5310095Z             if compiled:
2025-05-07T20:32:37.5310191Z                 op = torch.compile(op)
2025-05-07T20:32:37.5310299Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5310368Z     
2025-05-07T20:32:37.5310456Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5310460Z 
2025-05-07T20:32:37.5310560Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5310685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5310786Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5310882Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5311391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5311533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5311897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5312124Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5312472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5312564Z     kernel = self.compile(
2025-05-07T20:32:37.5312953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5313131Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5313252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5313256Z 
2025-05-07T20:32:37.5313470Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7acbaf370>
2025-05-07T20:32:37.5314264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5314820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac6884c0>}
2025-05-07T20:32:37.5315576Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5315808Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac609df0>
2025-05-07T20:32:37.5315813Z 
2025-05-07T20:32:37.5315979Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5316244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5316356Z                            module_map=module_map)
2025-05-07T20:32:37.5316517Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5316614Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5316731Z E       ^
2025-05-07T20:32:37.5317090Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5317095Z 
2025-05-07T20:32:37.5317519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5317526Z 
2025-05-07T20:32:37.5317631Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5317852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5317928Z     T=128,
2025-05-07T20:32:37.5318000Z     D=5120,
2025-05-07T20:32:37.5318079Z     scale_ub=None,
2025-05-07T20:32:37.5318164Z     contiguous=True,
2025-05-07T20:32:37.5318245Z     compiled=False,
2025-05-07T20:32:37.5318316Z )
2025-05-07T20:32:37.5318567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5318755Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.5318762Z 
2025-05-07T20:32:37.5318841Z     @given(
2025-05-07T20:32:37.5318959Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5319057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5319176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5319295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5319407Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5319482Z     )
2025-05-07T20:32:37.5319726Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5319817Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5319892Z         self,
2025-05-07T20:32:37.5319966Z         T: int,
2025-05-07T20:32:37.5320041Z         D: int,
2025-05-07T20:32:37.5320270Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5320358Z         contiguous: bool,
2025-05-07T20:32:37.5320443Z         compiled: bool,
2025-05-07T20:32:37.5320522Z     ) -> None:
2025-05-07T20:32:37.5320615Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5320689Z     
2025-05-07T20:32:37.5320856Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5320927Z     
2025-05-07T20:32:37.5321021Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5321148Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5321233Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5321312Z         x0 = x[:, :D]
2025-05-07T20:32:37.5321389Z         x1 = x[:, D:]
2025-05-07T20:32:37.5321457Z     
2025-05-07T20:32:37.5321543Z         if contiguous:
2025-05-07T20:32:37.5321632Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5321721Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5321790Z     
2025-05-07T20:32:37.5321881Z         if scale_ub is not None:
2025-05-07T20:32:37.5321988Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5322120Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5322194Z             )
2025-05-07T20:32:37.5322272Z         else:
2025-05-07T20:32:37.5322413Z             scale_ub_tensor = None
2025-05-07T20:32:37.5322484Z     
2025-05-07T20:32:37.5322618Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5322706Z             op = silu_mul_quant
2025-05-07T20:32:37.5322828Z             if compiled:
2025-05-07T20:32:37.5322928Z                 op = torch.compile(op)
2025-05-07T20:32:37.5323031Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5323104Z     
2025-05-07T20:32:37.5323193Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5323198Z 
2025-05-07T20:32:37.5323295Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5323422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5323524Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5323622Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5324128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5324268Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5324632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5324854Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5325202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5325297Z     kernel = self.compile(
2025-05-07T20:32:37.5325681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5325856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5325981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5325985Z 
2025-05-07T20:32:37.5326194Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac629750>
2025-05-07T20:32:37.5326983Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5327493Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac688940>}
2025-05-07T20:32:37.5328251Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5328533Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac6351f0>
2025-05-07T20:32:37.5328538Z 
2025-05-07T20:32:37.5328728Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5329003Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5329109Z                            module_map=module_map)
2025-05-07T20:32:37.5329274Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5329371Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5329447Z E       ^
2025-05-07T20:32:37.5329806Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5329811Z 
2025-05-07T20:32:37.5330229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5330233Z 
2025-05-07T20:32:37.5330335Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5330561Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5330635Z     T=128,
2025-05-07T20:32:37.5330712Z     D=7168,
2025-05-07T20:32:37.5330793Z     scale_ub=None,
2025-05-07T20:32:37.5330875Z     contiguous=True,
2025-05-07T20:32:37.5331002Z     compiled=False,
2025-05-07T20:32:37.5331072Z )
2025-05-07T20:32:37.5331290Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5331460Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.5331504Z 
2025-05-07T20:32:37.5331579Z     @given(
2025-05-07T20:32:37.5331695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5331796Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5331910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5332029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5332142Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5332215Z     )
2025-05-07T20:32:37.5332463Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5332596Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5332668Z         self,
2025-05-07T20:32:37.5332748Z         T: int,
2025-05-07T20:32:37.5332824Z         D: int,
2025-05-07T20:32:37.5332920Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5333009Z         contiguous: bool,
2025-05-07T20:32:37.5333091Z         compiled: bool,
2025-05-07T20:32:37.5333169Z     ) -> None:
2025-05-07T20:32:37.5333263Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5333332Z     
2025-05-07T20:32:37.5333503Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5333573Z     
2025-05-07T20:32:37.5333663Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5333788Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5333874Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5333953Z         x0 = x[:, :D]
2025-05-07T20:32:37.5334032Z         x1 = x[:, D:]
2025-05-07T20:32:37.5334100Z     
2025-05-07T20:32:37.5334180Z         if contiguous:
2025-05-07T20:32:37.5334275Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5334362Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5334432Z     
2025-05-07T20:32:37.5334525Z         if scale_ub is not None:
2025-05-07T20:32:37.5334627Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5334763Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5334838Z             )
2025-05-07T20:32:37.5334910Z         else:
2025-05-07T20:32:37.5335007Z             scale_ub_tensor = None
2025-05-07T20:32:37.5335078Z     
2025-05-07T20:32:37.5335206Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5335295Z             op = silu_mul_quant
2025-05-07T20:32:37.5335377Z             if compiled:
2025-05-07T20:32:37.5335474Z                 op = torch.compile(op)
2025-05-07T20:32:37.5335626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5335696Z     
2025-05-07T20:32:37.5335784Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5335789Z 
2025-05-07T20:32:37.5335890Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5336017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5336120Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5336217Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5336718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5336821Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5337180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5337400Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5337747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5337838Z     kernel = self.compile(
2025-05-07T20:32:37.5338287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5338507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5338631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5338636Z 
2025-05-07T20:32:37.5338887Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac669450>
2025-05-07T20:32:37.5339669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5340182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac689240>}
2025-05-07T20:32:37.5340934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5341171Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac740a70>
2025-05-07T20:32:37.5341178Z 
2025-05-07T20:32:37.5341343Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5341610Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5341721Z                            module_map=module_map)
2025-05-07T20:32:37.5341882Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5341980Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5342056Z E       ^
2025-05-07T20:32:37.5342416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5342420Z 
2025-05-07T20:32:37.5342842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5342849Z 
2025-05-07T20:32:37.5342954Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5343176Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5343256Z     T=2048,
2025-05-07T20:32:37.5343331Z     D=7168,
2025-05-07T20:32:37.5343415Z     scale_ub=1200.0,
2025-05-07T20:32:37.5343501Z     contiguous=True,
2025-05-07T20:32:37.5343588Z     compiled=False,
2025-05-07T20:32:37.5343658Z )
2025-05-07T20:32:37.5343882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5344056Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.5344060Z 
2025-05-07T20:32:37.5344134Z     @given(
2025-05-07T20:32:37.5344294Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5344393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5344511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5344631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5344744Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5344819Z     )
2025-05-07T20:32:37.5345065Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5345160Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5345235Z         self,
2025-05-07T20:32:37.5345308Z         T: int,
2025-05-07T20:32:37.5345383Z         D: int,
2025-05-07T20:32:37.5345478Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5345566Z         contiguous: bool,
2025-05-07T20:32:37.5345654Z         compiled: bool,
2025-05-07T20:32:37.5345729Z     ) -> None:
2025-05-07T20:32:37.5345821Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5345892Z     
2025-05-07T20:32:37.5346064Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5347932Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5347980Z 
2025-05-07T20:32:37.5348097Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5348102Z 
2025-05-07T20:32:37.5348202Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5348430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5348504Z     T=1,
2025-05-07T20:32:37.5348583Z     D=5120,
2025-05-07T20:32:37.5348664Z     scale_ub=1200.0,
2025-05-07T20:32:37.5348746Z     contiguous=True,
2025-05-07T20:32:37.5348870Z     compiled=False,
2025-05-07T20:32:37.5348938Z )
2025-05-07T20:32:37.5349157Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5349325Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.5349330Z 
2025-05-07T20:32:37.5349403Z     @given(
2025-05-07T20:32:37.5349522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5349621Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5349733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5349852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5349968Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5350037Z     )
2025-05-07T20:32:37.5350285Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5350377Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5350449Z         self,
2025-05-07T20:32:37.5350525Z         T: int,
2025-05-07T20:32:37.5350603Z         D: int,
2025-05-07T20:32:37.5350701Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5350793Z         contiguous: bool,
2025-05-07T20:32:37.5350875Z         compiled: bool,
2025-05-07T20:32:37.5350952Z     ) -> None:
2025-05-07T20:32:37.5351047Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5351121Z     
2025-05-07T20:32:37.5351291Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5351361Z     
2025-05-07T20:32:37.5351451Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5351578Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5351664Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5351744Z         x0 = x[:, :D]
2025-05-07T20:32:37.5351821Z         x1 = x[:, D:]
2025-05-07T20:32:37.5351889Z     
2025-05-07T20:32:37.5352015Z         if contiguous:
2025-05-07T20:32:37.5352106Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5352196Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5352272Z     
2025-05-07T20:32:37.5352360Z         if scale_ub is not None:
2025-05-07T20:32:37.5352468Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5352606Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5352677Z             )
2025-05-07T20:32:37.5352750Z         else:
2025-05-07T20:32:37.5352848Z             scale_ub_tensor = None
2025-05-07T20:32:37.5352921Z     
2025-05-07T20:32:37.5353053Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5353141Z             op = silu_mul_quant
2025-05-07T20:32:37.5353222Z             if compiled:
2025-05-07T20:32:37.5353350Z                 op = torch.compile(op)
2025-05-07T20:32:37.5353499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5353607Z     
2025-05-07T20:32:37.5353742Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5353752Z 
2025-05-07T20:32:37.5353851Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5353978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5354083Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5354239Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5354755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5354890Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5355251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5355476Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5356039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5356135Z     kernel = self.compile(
2025-05-07T20:32:37.5356530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5356707Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5356956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5356961Z 
2025-05-07T20:32:37.5357170Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac66a470>
2025-05-07T20:32:37.5357956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5358474Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7ac68a200>}
2025-05-07T20:32:37.5359235Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5359436Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac7b2c70>
2025-05-07T20:32:37.5359441Z 
2025-05-07T20:32:37.5359608Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5359877Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5359987Z                            module_map=module_map)
2025-05-07T20:32:37.5360149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5360251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5360325Z E       ^
2025-05-07T20:32:37.5360682Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5360687Z 
2025-05-07T20:32:37.5361174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5361179Z 
2025-05-07T20:32:37.5361284Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5361515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5361590Z     T=2048,
2025-05-07T20:32:37.5361664Z     D=5120,
2025-05-07T20:32:37.5361749Z     scale_ub=None,
2025-05-07T20:32:37.5361833Z     contiguous=True,
2025-05-07T20:32:37.5361917Z     compiled=False,
2025-05-07T20:32:37.5361994Z )
2025-05-07T20:32:37.5362216Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5362388Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.5362397Z 
2025-05-07T20:32:37.5362470Z     @given(
2025-05-07T20:32:37.5362587Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5362688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5362807Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5362925Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5363041Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5363118Z     )
2025-05-07T20:32:37.5363424Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5363521Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5363597Z         self,
2025-05-07T20:32:37.5363673Z         T: int,
2025-05-07T20:32:37.5363804Z         D: int,
2025-05-07T20:32:37.5363903Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5363993Z         contiguous: bool,
2025-05-07T20:32:37.5364077Z         compiled: bool,
2025-05-07T20:32:37.5364153Z     ) -> None:
2025-05-07T20:32:37.5364249Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5364320Z     
2025-05-07T20:32:37.5364490Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5364564Z     
2025-05-07T20:32:37.5364658Z >       x_sign = torch.sign(x)
2025-05-07T20:32:37.5366482Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5366533Z 
2025-05-07T20:32:37.5366649Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:37.5366654Z 
2025-05-07T20:32:37.5366756Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5366983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5367058Z     T=16384,
2025-05-07T20:32:37.5367135Z     D=5120,
2025-05-07T20:32:37.5367215Z     scale_ub=None,
2025-05-07T20:32:37.5367298Z     contiguous=True,
2025-05-07T20:32:37.5367385Z     compiled=False,
2025-05-07T20:32:37.5367456Z )
2025-05-07T20:32:37.5367672Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5367854Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.5367858Z 
2025-05-07T20:32:37.5367931Z     @given(
2025-05-07T20:32:37.5368046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5368150Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5368264Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5368387Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5368499Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5368569Z     )
2025-05-07T20:32:37.5368819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5368955Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5369030Z         self,
2025-05-07T20:32:37.5369106Z         T: int,
2025-05-07T20:32:37.5369178Z         D: int,
2025-05-07T20:32:37.5369275Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5369366Z         contiguous: bool,
2025-05-07T20:32:37.5369451Z         compiled: bool,
2025-05-07T20:32:37.5369527Z     ) -> None:
2025-05-07T20:32:37.5369623Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5369691Z     
2025-05-07T20:32:37.5369863Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5371685Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5371693Z 
2025-05-07T20:32:37.5371812Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5371817Z 
2025-05-07T20:32:37.5371962Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5372186Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5372271Z     T=4096,
2025-05-07T20:32:37.5372386Z     D=5120,
2025-05-07T20:32:37.5372466Z     scale_ub=None,
2025-05-07T20:32:37.5372555Z     contiguous=True,
2025-05-07T20:32:37.5372639Z     compiled=False,
2025-05-07T20:32:37.5372708Z )
2025-05-07T20:32:37.5372931Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5373100Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.5373105Z 
2025-05-07T20:32:37.5373187Z     @given(
2025-05-07T20:32:37.5373306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5373405Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5373562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5373682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5373794Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5373867Z     )
2025-05-07T20:32:37.5374112Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5374209Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5374286Z         self,
2025-05-07T20:32:37.5374359Z         T: int,
2025-05-07T20:32:37.5374434Z         D: int,
2025-05-07T20:32:37.5374531Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5374620Z         contiguous: bool,
2025-05-07T20:32:37.5374707Z         compiled: bool,
2025-05-07T20:32:37.5374784Z     ) -> None:
2025-05-07T20:32:37.5374876Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5374951Z     
2025-05-07T20:32:37.5375120Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5376940Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5376951Z 
2025-05-07T20:32:37.5377068Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5377072Z 
2025-05-07T20:32:37.5377175Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5377400Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5377517Z     T=2048,
2025-05-07T20:32:37.5377597Z     D=5120,
2025-05-07T20:32:37.5377677Z     scale_ub=None,
2025-05-07T20:32:37.5377762Z     contiguous=False,
2025-05-07T20:32:37.5377851Z     compiled=False,
2025-05-07T20:32:37.5377921Z )
2025-05-07T20:32:37.5378193Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5378369Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.5378374Z 
2025-05-07T20:32:37.5378450Z     @given(
2025-05-07T20:32:37.5378567Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5378668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5378781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5378907Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5379019Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5379096Z     )
2025-05-07T20:32:37.5379349Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5379442Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5379516Z         self,
2025-05-07T20:32:37.5379594Z         T: int,
2025-05-07T20:32:37.5379668Z         D: int,
2025-05-07T20:32:37.5379808Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5379902Z         contiguous: bool,
2025-05-07T20:32:37.5379987Z         compiled: bool,
2025-05-07T20:32:37.5380065Z     ) -> None:
2025-05-07T20:32:37.5380163Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5380273Z     
2025-05-07T20:32:37.5380443Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5382254Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5382299Z 
2025-05-07T20:32:37.5382422Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5382427Z 
2025-05-07T20:32:37.5382529Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5382752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5382833Z     T=4096,
2025-05-07T20:32:37.5382907Z     D=7168,
2025-05-07T20:32:37.5382986Z     scale_ub=None,
2025-05-07T20:32:37.5383072Z     contiguous=True,
2025-05-07T20:32:37.5383153Z     compiled=True,
2025-05-07T20:32:37.5383225Z )
2025-05-07T20:32:37.5383443Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5383611Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.5383618Z 
2025-05-07T20:32:37.5383695Z     @given(
2025-05-07T20:32:37.5383811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5383911Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5384027Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5384147Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5384260Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5384335Z     )
2025-05-07T20:32:37.5384579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5384674Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5384751Z         self,
2025-05-07T20:32:37.5384826Z         T: int,
2025-05-07T20:32:37.5384902Z         D: int,
2025-05-07T20:32:37.5385001Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5385089Z         contiguous: bool,
2025-05-07T20:32:37.5385179Z         compiled: bool,
2025-05-07T20:32:37.5385254Z     ) -> None:
2025-05-07T20:32:37.5385393Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5385467Z     
2025-05-07T20:32:37.5385636Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5387464Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5387473Z 
2025-05-07T20:32:37.5387589Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5387594Z 
2025-05-07T20:32:37.5387697Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5387925Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5387999Z     T=2048,
2025-05-07T20:32:37.5388078Z     D=5120,
2025-05-07T20:32:37.5388163Z     scale_ub=1200.0,
2025-05-07T20:32:37.5388247Z     contiguous=False,
2025-05-07T20:32:37.5388373Z     compiled=False,
2025-05-07T20:32:37.5388445Z )
2025-05-07T20:32:37.5388663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5388839Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.5388905Z 
2025-05-07T20:32:37.5388980Z     @given(
2025-05-07T20:32:37.5389098Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5389200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5389314Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5389432Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5389545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5389619Z     )
2025-05-07T20:32:37.5389867Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5390003Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5390078Z         self,
2025-05-07T20:32:37.5390158Z         T: int,
2025-05-07T20:32:37.5390233Z         D: int,
2025-05-07T20:32:37.5390331Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5390422Z         contiguous: bool,
2025-05-07T20:32:37.5390505Z         compiled: bool,
2025-05-07T20:32:37.5390583Z     ) -> None:
2025-05-07T20:32:37.5390680Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5390751Z     
2025-05-07T20:32:37.5390922Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5392744Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5392752Z 
2025-05-07T20:32:37.5392872Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5392877Z 
2025-05-07T20:32:37.5392978Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5393205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5393282Z     T=4096,
2025-05-07T20:32:37.5393355Z     D=7168,
2025-05-07T20:32:37.5393435Z     scale_ub=1200.0,
2025-05-07T20:32:37.5393522Z     contiguous=True,
2025-05-07T20:32:37.5393605Z     compiled=False,
2025-05-07T20:32:37.5393676Z )
2025-05-07T20:32:37.5393893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5394111Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.5394116Z 
2025-05-07T20:32:37.5394195Z     @given(
2025-05-07T20:32:37.5394315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5394413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5394532Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5394648Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5394761Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5394837Z     )
2025-05-07T20:32:37.5395081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5395175Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5395252Z         self,
2025-05-07T20:32:37.5395326Z         T: int,
2025-05-07T20:32:37.5395403Z         D: int,
2025-05-07T20:32:37.5395499Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5395587Z         contiguous: bool,
2025-05-07T20:32:37.5395675Z         compiled: bool,
2025-05-07T20:32:37.5395751Z     ) -> None:
2025-05-07T20:32:37.5395844Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5395918Z     
2025-05-07T20:32:37.5396088Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5397952Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5397993Z 
2025-05-07T20:32:37.5398109Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5398114Z 
2025-05-07T20:32:37.5398217Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5398444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5398558Z     T=16384,
2025-05-07T20:32:37.5398637Z     D=7168,
2025-05-07T20:32:37.5398719Z     scale_ub=None,
2025-05-07T20:32:37.5398805Z     contiguous=False,
2025-05-07T20:32:37.5398889Z     compiled=True,
2025-05-07T20:32:37.5398959Z )
2025-05-07T20:32:37.5399177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5399358Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.5399362Z 
2025-05-07T20:32:37.5399438Z     @given(
2025-05-07T20:32:37.5399553Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5399655Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5399767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5399887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5400005Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5400077Z     )
2025-05-07T20:32:37.5400324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5400420Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5400496Z         self,
2025-05-07T20:32:37.5400578Z         T: int,
2025-05-07T20:32:37.5400649Z         D: int,
2025-05-07T20:32:37.5400746Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5400836Z         contiguous: bool,
2025-05-07T20:32:37.5400920Z         compiled: bool,
2025-05-07T20:32:37.5400995Z     ) -> None:
2025-05-07T20:32:37.5401090Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5401159Z     
2025-05-07T20:32:37.5401331Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5403196Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5403205Z 
2025-05-07T20:32:37.5403325Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5403333Z 
2025-05-07T20:32:37.5403435Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5403657Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5403735Z     T=4096,
2025-05-07T20:32:37.5403807Z     D=7168,
2025-05-07T20:32:37.5403885Z     scale_ub=None,
2025-05-07T20:32:37.5403969Z     contiguous=True,
2025-05-07T20:32:37.5404050Z     compiled=False,
2025-05-07T20:32:37.5404118Z )
2025-05-07T20:32:37.5404337Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5404504Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.5404511Z 
2025-05-07T20:32:37.5404587Z     @given(
2025-05-07T20:32:37.5404742Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5404842Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5404957Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5405072Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5405224Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5405296Z     )
2025-05-07T20:32:37.5405539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5405630Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5405705Z         self,
2025-05-07T20:32:37.5405777Z         T: int,
2025-05-07T20:32:37.5405853Z         D: int,
2025-05-07T20:32:37.5405948Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5406038Z         contiguous: bool,
2025-05-07T20:32:37.5406123Z         compiled: bool,
2025-05-07T20:32:37.5406198Z     ) -> None:
2025-05-07T20:32:37.5406330Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5406402Z     
2025-05-07T20:32:37.5406572Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5408391Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5408400Z 
2025-05-07T20:32:37.5408517Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5408522Z 
2025-05-07T20:32:37.5408622Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5408847Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5408926Z     T=16384,
2025-05-07T20:32:37.5409003Z     D=7168,
2025-05-07T20:32:37.5409081Z     scale_ub=None,
2025-05-07T20:32:37.5409163Z     contiguous=True,
2025-05-07T20:32:37.5409246Z     compiled=False,
2025-05-07T20:32:37.5409315Z )
2025-05-07T20:32:37.5409532Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5409706Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:37.5409711Z 
2025-05-07T20:32:37.5409784Z     @given(
2025-05-07T20:32:37.5409899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5409998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5410115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5410275Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5410388Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5410461Z     )
2025-05-07T20:32:37.5410707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5410801Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5410874Z         self,
2025-05-07T20:32:37.5410948Z         T: int,
2025-05-07T20:32:37.5411021Z         D: int,
2025-05-07T20:32:37.5411116Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5411208Z         contiguous: bool,
2025-05-07T20:32:37.5411289Z         compiled: bool,
2025-05-07T20:32:37.5411363Z     ) -> None:
2025-05-07T20:32:37.5411458Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5411527Z     
2025-05-07T20:32:37.5411696Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5413555Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5413564Z 
2025-05-07T20:32:37.5413722Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5413726Z 
2025-05-07T20:32:37.5413827Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5414051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5414128Z     T=16384,
2025-05-07T20:32:37.5414200Z     D=7168,
2025-05-07T20:32:37.5414279Z     scale_ub=1200.0,
2025-05-07T20:32:37.5414362Z     contiguous=True,
2025-05-07T20:32:37.5414443Z     compiled=False,
2025-05-07T20:32:37.5414519Z )
2025-05-07T20:32:37.5418315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5418502Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.5418581Z 
2025-05-07T20:32:37.5418664Z     @given(
2025-05-07T20:32:37.5418791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5418888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5419003Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5419123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5419237Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5419307Z     )
2025-05-07T20:32:37.5419554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5419649Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5419723Z         self,
2025-05-07T20:32:37.5419797Z         T: int,
2025-05-07T20:32:37.5419876Z         D: int,
2025-05-07T20:32:37.5419971Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5420059Z         contiguous: bool,
2025-05-07T20:32:37.5420148Z         compiled: bool,
2025-05-07T20:32:37.5420225Z     ) -> None:
2025-05-07T20:32:37.5420318Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5420392Z     
2025-05-07T20:32:37.5420561Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5422441Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5422449Z 
2025-05-07T20:32:37.5422567Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5422571Z 
2025-05-07T20:32:37.5422676Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5422902Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5422977Z     T=128,
2025-05-07T20:32:37.5423054Z     D=5120,
2025-05-07T20:32:37.5423134Z     scale_ub=1200.0,
2025-05-07T20:32:37.5423222Z     contiguous=False,
2025-05-07T20:32:37.5423307Z     compiled=False,
2025-05-07T20:32:37.5423376Z )
2025-05-07T20:32:37.5423592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5423765Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.5423770Z 
2025-05-07T20:32:37.5423845Z     @given(
2025-05-07T20:32:37.5423968Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5424064Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5424180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5424297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5424411Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5424482Z     )
2025-05-07T20:32:37.5424775Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5424868Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5424941Z         self,
2025-05-07T20:32:37.5425018Z         T: int,
2025-05-07T20:32:37.5425130Z         D: int,
2025-05-07T20:32:37.5425229Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5425316Z         contiguous: bool,
2025-05-07T20:32:37.5425398Z         compiled: bool,
2025-05-07T20:32:37.5425476Z     ) -> None:
2025-05-07T20:32:37.5425568Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5425638Z     
2025-05-07T20:32:37.5425807Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5425878Z     
2025-05-07T20:32:37.5425970Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5426096Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5426226Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5426304Z         x0 = x[:, :D]
2025-05-07T20:32:37.5426384Z         x1 = x[:, D:]
2025-05-07T20:32:37.5426455Z     
2025-05-07T20:32:37.5426537Z         if contiguous:
2025-05-07T20:32:37.5426633Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5426720Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5426794Z     
2025-05-07T20:32:37.5426890Z         if scale_ub is not None:
2025-05-07T20:32:37.5426994Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5427132Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5427204Z             )
2025-05-07T20:32:37.5427279Z         else:
2025-05-07T20:32:37.5427373Z             scale_ub_tensor = None
2025-05-07T20:32:37.5427442Z     
2025-05-07T20:32:37.5427574Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5427666Z             op = silu_mul_quant
2025-05-07T20:32:37.5427748Z             if compiled:
2025-05-07T20:32:37.5427847Z                 op = torch.compile(op)
2025-05-07T20:32:37.5427956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5428028Z     
2025-05-07T20:32:37.5428120Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5428125Z 
2025-05-07T20:32:37.5428220Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5428345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5428451Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5428549Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5429055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5429153Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5429558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5429786Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5430133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5430226Z     kernel = self.compile(
2025-05-07T20:32:37.5430618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5430796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5430917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5430928Z 
2025-05-07T20:32:37.5431138Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac507d30>
2025-05-07T20:32:37.5431927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5432441Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acabdea0>}
2025-05-07T20:32:37.5433237Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5433471Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac5d4670>
2025-05-07T20:32:37.5433476Z 
2025-05-07T20:32:37.5433638Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5433903Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5434012Z                            module_map=module_map)
2025-05-07T20:32:37.5434175Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5434275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5434349Z E       ^
2025-05-07T20:32:37.5434705Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5434752Z 
2025-05-07T20:32:37.5435173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5435177Z 
2025-05-07T20:32:37.5435280Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5435505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5435581Z     T=2048,
2025-05-07T20:32:37.5435656Z     D=7168,
2025-05-07T20:32:37.5435737Z     scale_ub=None,
2025-05-07T20:32:37.5435820Z     contiguous=False,
2025-05-07T20:32:37.5435901Z     compiled=False,
2025-05-07T20:32:37.5435973Z )
2025-05-07T20:32:37.5436193Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5436366Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:37.5436370Z 
2025-05-07T20:32:37.5436450Z     @given(
2025-05-07T20:32:37.5436571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5436669Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5436782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5436900Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5437014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5437089Z     )
2025-05-07T20:32:37.5437333Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5437426Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5437504Z         self,
2025-05-07T20:32:37.5437577Z         T: int,
2025-05-07T20:32:37.5437650Z         D: int,
2025-05-07T20:32:37.5437749Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5437889Z         contiguous: bool,
2025-05-07T20:32:37.5437973Z         compiled: bool,
2025-05-07T20:32:37.5438053Z     ) -> None:
2025-05-07T20:32:37.5438145Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5438217Z     
2025-05-07T20:32:37.5438389Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5440211Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5440224Z 
2025-05-07T20:32:37.5440339Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5440346Z 
2025-05-07T20:32:37.5440448Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5440672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5440748Z     T=128,
2025-05-07T20:32:37.5440821Z     D=7168,
2025-05-07T20:32:37.5440948Z     scale_ub=1200.0,
2025-05-07T20:32:37.5441031Z     contiguous=True,
2025-05-07T20:32:37.5441111Z     compiled=True,
2025-05-07T20:32:37.5441185Z )
2025-05-07T20:32:37.5441402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5441610Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.5441617Z 
2025-05-07T20:32:37.5441690Z     @given(
2025-05-07T20:32:37.5441806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5441904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5442016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5442133Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5442247Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5442319Z     )
2025-05-07T20:32:37.5442604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5442701Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5442774Z         self,
2025-05-07T20:32:37.5442851Z         T: int,
2025-05-07T20:32:37.5442923Z         D: int,
2025-05-07T20:32:37.5443019Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5443110Z         contiguous: bool,
2025-05-07T20:32:37.5443192Z         compiled: bool,
2025-05-07T20:32:37.5443267Z     ) -> None:
2025-05-07T20:32:37.5443361Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5443431Z     
2025-05-07T20:32:37.5443598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5443671Z     
2025-05-07T20:32:37.5443759Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5443885Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5443973Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5444053Z         x0 = x[:, :D]
2025-05-07T20:32:37.5444130Z         x1 = x[:, D:]
2025-05-07T20:32:37.5444205Z     
2025-05-07T20:32:37.5444285Z         if contiguous:
2025-05-07T20:32:37.5444379Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5444465Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5444534Z     
2025-05-07T20:32:37.5444627Z         if scale_ub is not None:
2025-05-07T20:32:37.5444730Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5444866Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5444941Z             )
2025-05-07T20:32:37.5445015Z         else:
2025-05-07T20:32:37.5445107Z             scale_ub_tensor = None
2025-05-07T20:32:37.5445178Z     
2025-05-07T20:32:37.5445307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5445395Z             op = silu_mul_quant
2025-05-07T20:32:37.5445478Z             if compiled:
2025-05-07T20:32:37.5445622Z                 op = torch.compile(op)
2025-05-07T20:32:37.5445730Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5445802Z     
2025-05-07T20:32:37.5445890Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.5445894Z 
2025-05-07T20:32:37.5445993Z moe/activation_test.py:117: 
2025-05-07T20:32:37.5446118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5446219Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.5446322Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5446696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.5446786Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.5447287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.5447382Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.5447746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5447971Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5448395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5448505Z     kernel = self.compile(
2025-05-07T20:32:37.5448914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5449131Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5449256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5449261Z 
2025-05-07T20:32:37.5449468Z self = <triton.compiler.compiler.ASTSource object at 0x7fa7ac451f00>
2025-05-07T20:32:37.5450258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5450890Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fa8dfb7aef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fa7acabf7f0>}
2025-05-07T20:32:37.5451650Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5451848Z context = <triton._C.libtriton.ir.context object at 0x7fa7ac4b2f30>
2025-05-07T20:32:37.5451852Z 
2025-05-07T20:32:37.5452015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5452281Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5452389Z                            module_map=module_map)
2025-05-07T20:32:37.5452552Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5452649Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5452726Z E       ^
2025-05-07T20:32:37.5453088Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5453093Z 
2025-05-07T20:32:37.5453509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5453516Z 
2025-05-07T20:32:37.5453621Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5453841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5453915Z     T=128,
2025-05-07T20:32:37.5453992Z     D=7168,
2025-05-07T20:32:37.5454072Z     scale_ub=1200.0,
2025-05-07T20:32:37.5454154Z     contiguous=True,
2025-05-07T20:32:37.5454236Z     compiled=False,
2025-05-07T20:32:37.5454306Z )
2025-05-07T20:32:37.5454568Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5454742Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.5454749Z 
2025-05-07T20:32:37.5454823Z     @given(
2025-05-07T20:32:37.5454944Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5455043Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5455155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5455274Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5455387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5455458Z     )
2025-05-07T20:32:37.5456040Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5456139Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5456212Z         self,
2025-05-07T20:32:37.5456290Z         T: int,
2025-05-07T20:32:37.5456362Z         D: int,
2025-05-07T20:32:37.5456464Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5456554Z         contiguous: bool,
2025-05-07T20:32:37.5456638Z         compiled: bool,
2025-05-07T20:32:37.5456720Z     ) -> None:
2025-05-07T20:32:37.5456817Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5456886Z     
2025-05-07T20:32:37.5457149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5457222Z     
2025-05-07T20:32:37.5457313Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5457442Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5459409Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5459487Z 
2025-05-07T20:32:37.5459612Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.5459616Z 
2025-05-07T20:32:37.5459719Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5459942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5460019Z     T=128,
2025-05-07T20:32:37.5460091Z     D=5120,
2025-05-07T20:32:37.5460180Z     scale_ub=1200.0,
2025-05-07T20:32:37.5460262Z     contiguous=True,
2025-05-07T20:32:37.5460342Z     compiled=True,
2025-05-07T20:32:37.5460413Z )
2025-05-07T20:32:37.5460630Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5460795Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.5460799Z 
2025-05-07T20:32:37.5460875Z     @given(
2025-05-07T20:32:37.5460993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5461090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5461204Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5461322Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5461439Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5461509Z     )
2025-05-07T20:32:37.5461753Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5461851Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5461924Z         self,
2025-05-07T20:32:37.5461997Z         T: int,
2025-05-07T20:32:37.5462073Z         D: int,
2025-05-07T20:32:37.5462168Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5462255Z         contiguous: bool,
2025-05-07T20:32:37.5462340Z         compiled: bool,
2025-05-07T20:32:37.5462415Z     ) -> None:
2025-05-07T20:32:37.5462507Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5462580Z     
2025-05-07T20:32:37.5462810Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5462888Z     
2025-05-07T20:32:37.5462976Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5463101Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5464915Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5464923Z 
2025-05-07T20:32:37.5465040Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:37.5465044Z 
2025-05-07T20:32:37.5465148Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5465369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5465445Z     T=128,
2025-05-07T20:32:37.5465521Z     D=7168,
2025-05-07T20:32:37.5465601Z     scale_ub=None,
2025-05-07T20:32:37.5465725Z     contiguous=True,
2025-05-07T20:32:37.5465810Z     compiled=True,
2025-05-07T20:32:37.5465880Z )
2025-05-07T20:32:37.5466098Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5466302Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.5466307Z 
2025-05-07T20:32:37.5466381Z     @given(
2025-05-07T20:32:37.5466497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5466593Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5466704Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5466820Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5466935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5467006Z     )
2025-05-07T20:32:37.5467257Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5467392Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5467468Z         self,
2025-05-07T20:32:37.5467543Z         T: int,
2025-05-07T20:32:37.5467616Z         D: int,
2025-05-07T20:32:37.5467713Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5467799Z         contiguous: bool,
2025-05-07T20:32:37.5467885Z         compiled: bool,
2025-05-07T20:32:37.5467962Z     ) -> None:
2025-05-07T20:32:37.5468053Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5468122Z     
2025-05-07T20:32:37.5468291Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5470111Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:37.5470119Z 
2025-05-07T20:32:37.5470237Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:37.5470368Z =============================== warnings summary ===============================
2025-05-07T20:32:37.5470682Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:37.5470983Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:37.5471281Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:37.5472215Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:37.5472451Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:37.5472455Z 
2025-05-07T20:32:37.5472672Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:37.5472838Z ================= 1 failed, 1 deselected, 3 warnings in 17.47s =================
2025-05-07T20:32:39.0833470Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:39.1454746Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:32:39.1455335Z 
2025-05-07T20:32:41.1472244Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:43.2931412Z ============================= test session starts ==============================
2025-05-07T20:32:43.2932343Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:43.2932880Z cachedir: .pytest_cache
2025-05-07T20:32:43.2933472Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:43.2934332Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:43.2934751Z plugins: hypothesis-6.131.14
2025-05-07T20:32:44.8971797Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:45.0757223Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:45.0758266Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:45.0758544Z 
2025-05-07T20:32:47.5954608Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.5956113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.5956682Z     T=1,
2025-05-07T20:32:47.5956933Z     D=5120,
2025-05-07T20:32:47.5957189Z     scale_ub=None,
2025-05-07T20:32:47.5957441Z     contiguous=True,
2025-05-07T20:32:47.5957684Z     compiled=True,
2025-05-07T20:32:47.5957903Z )
2025-05-07T20:32:47.5958230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.5958733Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.5959008Z 
2025-05-07T20:32:47.5959089Z     @given(
2025-05-07T20:32:47.5959337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.5959653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.5959978Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.5960322Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.5960657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.5960952Z     )
2025-05-07T20:32:47.5961318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.5961769Z     def test_silu_mul_quant(
2025-05-07T20:32:47.5962013Z         self,
2025-05-07T20:32:47.5962216Z         T: int,
2025-05-07T20:32:47.5962421Z         D: int,
2025-05-07T20:32:47.5962642Z         scale_ub: Optional[float],
2025-05-07T20:32:47.5962927Z         contiguous: bool,
2025-05-07T20:32:47.5963176Z         compiled: bool,
2025-05-07T20:32:47.5963402Z     ) -> None:
2025-05-07T20:32:47.5963625Z         torch.manual_seed(2025)
2025-05-07T20:32:47.5963876Z     
2025-05-07T20:32:47.5964157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.5964508Z     
2025-05-07T20:32:47.5964840Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.5965140Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.5965462Z         x = x_sign * x_clamp
2025-05-07T20:32:47.5965708Z         x0 = x[:, :D]
2025-05-07T20:32:47.5965922Z         x1 = x[:, D:]
2025-05-07T20:32:47.5966139Z     
2025-05-07T20:32:47.5966334Z         if contiguous:
2025-05-07T20:32:47.5966570Z             x0 = x0.contiguous()
2025-05-07T20:32:47.5966840Z             x1 = x1.contiguous()
2025-05-07T20:32:47.5967088Z     
2025-05-07T20:32:47.5967282Z         if scale_ub is not None:
2025-05-07T20:32:47.5967569Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.5967919Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.5968231Z             )
2025-05-07T20:32:47.5968426Z         else:
2025-05-07T20:32:47.5968648Z             scale_ub_tensor = None
2025-05-07T20:32:47.5968908Z     
2025-05-07T20:32:47.5969149Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.5969473Z             op = silu_mul_quant
2025-05-07T20:32:47.5969733Z             if compiled:
2025-05-07T20:32:47.5969986Z                 op = torch.compile(op)
2025-05-07T20:32:47.5970299Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5970674Z     
2025-05-07T20:32:47.5970872Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.5971172Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.5971498Z     
2025-05-07T20:32:47.5971767Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.5972196Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.5972499Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.5972826Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.5973191Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.5973508Z     
2025-05-07T20:32:47.5973720Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.5973922Z 
2025-05-07T20:32:47.5974027Z moe/activation_test.py:126: 
2025-05-07T20:32:47.5974330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5974741Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.5975079Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.5975890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.5976668Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.5977476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.5978297Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.5979004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.5979747Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.5980516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:47.5981277Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.5982075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.5982731Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.5983351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.5983874Z     fn()
2025-05-07T20:32:47.5984396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.5984992Z     self.fn.run(
2025-05-07T20:32:47.5985522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.5986068Z     kernel = self.compile(
2025-05-07T20:32:47.5986623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.5987297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.5987698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5987931Z 
2025-05-07T20:32:47.5988147Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6e124ff10>
2025-05-07T20:32:47.5989253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.5990688Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6e133caf0>}
2025-05-07T20:32:47.5992127Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.5993179Z context = <triton._C.libtriton.ir.context object at 0x7fd71037bbb0>
2025-05-07T20:32:47.5993476Z 
2025-05-07T20:32:47.5993649Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.5994226Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.5994702Z                            module_map=module_map)
2025-05-07T20:32:47.5995084Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.5995456Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.5995734Z E       ^
2025-05-07T20:32:47.5996219Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.5996684Z 
2025-05-07T20:32:47.5997107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.5997675Z 
2025-05-07T20:32:47.5997792Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.5998211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.5998621Z     T=2048,
2025-05-07T20:32:47.5998821Z     D=5120,
2025-05-07T20:32:47.5999023Z     scale_ub=1200.0,
2025-05-07T20:32:47.5999258Z     contiguous=True,
2025-05-07T20:32:47.5999494Z     compiled=False,
2025-05-07T20:32:47.5999704Z )
2025-05-07T20:32:48.9360919Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.9361739Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:48.9362181Z 
2025-05-07T20:32:48.9362284Z     @given(
2025-05-07T20:32:48.9362626Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.9362973Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.9363294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.9363645Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.9363987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.9364282Z     )
2025-05-07T20:32:48.9364648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.9365106Z     def test_silu_mul_quant(
2025-05-07T20:32:48.9365359Z         self,
2025-05-07T20:32:48.9365563Z         T: int,
2025-05-07T20:32:48.9365768Z         D: int,
2025-05-07T20:32:48.9365991Z         scale_ub: Optional[float],
2025-05-07T20:32:48.9366275Z         contiguous: bool,
2025-05-07T20:32:48.9366520Z         compiled: bool,
2025-05-07T20:32:48.9366751Z     ) -> None:
2025-05-07T20:32:48.9366979Z         torch.manual_seed(2025)
2025-05-07T20:32:48.9367233Z     
2025-05-07T20:32:48.9367787Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.9368153Z     
2025-05-07T20:32:48.9368354Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.9368655Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.9368980Z         x = x_sign * x_clamp
2025-05-07T20:32:48.9369230Z         x0 = x[:, :D]
2025-05-07T20:32:48.9369446Z         x1 = x[:, D:]
2025-05-07T20:32:48.9369661Z     
2025-05-07T20:32:48.9369854Z         if contiguous:
2025-05-07T20:32:48.9370093Z             x0 = x0.contiguous()
2025-05-07T20:32:48.9370361Z             x1 = x1.contiguous()
2025-05-07T20:32:48.9370611Z     
2025-05-07T20:32:48.9370803Z         if scale_ub is not None:
2025-05-07T20:32:48.9371090Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.9371440Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.9371756Z             )
2025-05-07T20:32:48.9371963Z         else:
2025-05-07T20:32:48.9372220Z             scale_ub_tensor = None
2025-05-07T20:32:48.9372482Z     
2025-05-07T20:32:48.9372718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.9373044Z             op = silu_mul_quant
2025-05-07T20:32:48.9373301Z             if compiled:
2025-05-07T20:32:48.9373641Z                 op = torch.compile(op)
2025-05-07T20:32:48.9373950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.9374230Z     
2025-05-07T20:32:48.9374428Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.9374602Z 
2025-05-07T20:32:48.9374786Z moe/activation_test.py:117: 
2025-05-07T20:32:48.9375089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.9375424Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.9375708Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.9376423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.9377141Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.9377688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.9378568Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.9379274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.9386672Z     kernel = self.compile(
2025-05-07T20:32:48.9387261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.9387940Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.9388351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.9388581Z 
2025-05-07T20:32:48.9388802Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6e11259f0>
2025-05-07T20:32:48.9389909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.9391387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6e1219990>}
2025-05-07T20:32:48.9392798Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.9393844Z context = <triton._C.libtriton.ir.context object at 0x7fd6e18e84b0>
2025-05-07T20:32:48.9394134Z 
2025-05-07T20:32:48.9394313Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.9394844Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.9395391Z                            module_map=module_map)
2025-05-07T20:32:48.9395774Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.9396142Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.9396404Z E       ^
2025-05-07T20:32:48.9396890Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.9397356Z 
2025-05-07T20:32:48.9397780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.9398304Z 
2025-05-07T20:32:48.9398418Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.9398837Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.9399245Z     T=2048,
2025-05-07T20:32:48.9399440Z     D=5120,
2025-05-07T20:32:48.9399633Z     scale_ub=1200.0,
2025-05-07T20:32:48.9399866Z     contiguous=True,
2025-05-07T20:32:48.9400094Z     compiled=True,
2025-05-07T20:32:48.9400303Z )
2025-05-07T20:32:48.9400631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.9401134Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:48.9401408Z 
2025-05-07T20:32:48.9401546Z     @given(
2025-05-07T20:32:48.9401779Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.9402125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.9402467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.9402843Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.9403183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.9403476Z     )
2025-05-07T20:32:48.9403829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.9404278Z     def test_silu_mul_quant(
2025-05-07T20:32:48.9404527Z         self,
2025-05-07T20:32:48.9404720Z         T: int,
2025-05-07T20:32:48.9404927Z         D: int,
2025-05-07T20:32:48.9405153Z         scale_ub: Optional[float],
2025-05-07T20:32:48.9405432Z         contiguous: bool,
2025-05-07T20:32:48.9405671Z         compiled: bool,
2025-05-07T20:32:48.9405949Z     ) -> None:
2025-05-07T20:32:48.9406170Z         torch.manual_seed(2025)
2025-05-07T20:32:48.9406413Z     
2025-05-07T20:32:48.9406696Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.9407044Z     
2025-05-07T20:32:48.9407238Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.9407543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.9407861Z         x = x_sign * x_clamp
2025-05-07T20:32:48.9408099Z         x0 = x[:, :D]
2025-05-07T20:32:48.9408320Z         x1 = x[:, D:]
2025-05-07T20:32:48.9408532Z     
2025-05-07T20:32:48.9408719Z         if contiguous:
2025-05-07T20:32:48.9408960Z             x0 = x0.contiguous()
2025-05-07T20:32:48.9409223Z             x1 = x1.contiguous()
2025-05-07T20:32:48.9409460Z     
2025-05-07T20:32:48.9409660Z         if scale_ub is not None:
2025-05-07T20:32:48.9409931Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.9410272Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.9410585Z             )
2025-05-07T20:32:48.9410781Z         else:
2025-05-07T20:32:48.9410994Z             scale_ub_tensor = None
2025-05-07T20:32:48.9411250Z     
2025-05-07T20:32:48.9411491Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.9411802Z             op = silu_mul_quant
2025-05-07T20:32:48.9412056Z             if compiled:
2025-05-07T20:32:48.9412307Z                 op = torch.compile(op)
2025-05-07T20:32:48.9412602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.9412881Z     
2025-05-07T20:32:48.9413079Z         y_fp8, y_scale = fn()
2025-05-07T20:32:48.9413365Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:48.9413654Z     
2025-05-07T20:32:48.9413952Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.9414288Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:48.9414586Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:48.9414910Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:48.9415277Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.9415584Z     
2025-05-07T20:32:48.9415794Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:48.9415994Z 
2025-05-07T20:32:48.9416102Z moe/activation_test.py:126: 
2025-05-07T20:32:48.9416398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.9416735Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:48.9417071Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.9417867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:48.9418710Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:48.9419272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.9419970Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.9420714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:48.9421452Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:48.9422311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:48.9423069Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:48.9423803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:48.9424454Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:48.9425063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:48.9425627Z     fn()
2025-05-07T20:32:48.9426142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:48.9426727Z     self.fn.run(
2025-05-07T20:32:48.9427207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.9427744Z     kernel = self.compile(
2025-05-07T20:32:48.9428299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.9428967Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.9429366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.9429592Z 
2025-05-07T20:32:48.9429805Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6e124d9c0>
2025-05-07T20:32:48.9430909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.9432360Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6dbc2d3f0>}
2025-05-07T20:32:48.9433729Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.9434767Z context = <triton._C.libtriton.ir.context object at 0x7fd6dbb407b0>
2025-05-07T20:32:48.9435066Z 
2025-05-07T20:32:48.9435238Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.9435816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.9436299Z                            module_map=module_map)
2025-05-07T20:32:48.9436668Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.9437038Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:48.9437307Z E       ^
2025-05-07T20:32:48.9437781Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.9438247Z 
2025-05-07T20:32:48.9438672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.9439196Z 
2025-05-07T20:32:48.9439301Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.9439722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.9440120Z     T=16384,
2025-05-07T20:32:48.9440316Z     D=7168,
2025-05-07T20:32:48.9440516Z     scale_ub=1200.0,
2025-05-07T20:32:48.9440740Z     contiguous=False,
2025-05-07T20:32:48.9440970Z     compiled=False,
2025-05-07T20:32:48.9441176Z )
2025-05-07T20:32:50.1460974Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.1462164Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:50.1462478Z 
2025-05-07T20:32:50.1462565Z     @given(
2025-05-07T20:32:50.1462814Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.1463242Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.1463570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.1463917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.1464253Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.1464551Z     )
2025-05-07T20:32:50.1464921Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.1465389Z     def test_silu_mul_quant(
2025-05-07T20:32:50.1465641Z         self,
2025-05-07T20:32:50.1465848Z         T: int,
2025-05-07T20:32:50.1466058Z         D: int,
2025-05-07T20:32:50.1466428Z         scale_ub: Optional[float],
2025-05-07T20:32:50.1466714Z         contiguous: bool,
2025-05-07T20:32:50.1466974Z         compiled: bool,
2025-05-07T20:32:50.1467210Z     ) -> None:
2025-05-07T20:32:50.1467440Z         torch.manual_seed(2025)
2025-05-07T20:32:50.1467697Z     
2025-05-07T20:32:50.1467980Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.1468347Z     
2025-05-07T20:32:50.1468558Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.1468856Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.1469181Z         x = x_sign * x_clamp
2025-05-07T20:32:50.1469435Z         x0 = x[:, :D]
2025-05-07T20:32:50.1469656Z         x1 = x[:, D:]
2025-05-07T20:32:50.1469876Z     
2025-05-07T20:32:50.1470075Z         if contiguous:
2025-05-07T20:32:50.1470317Z             x0 = x0.contiguous()
2025-05-07T20:32:50.1470588Z             x1 = x1.contiguous()
2025-05-07T20:32:50.1470838Z     
2025-05-07T20:32:50.1471042Z         if scale_ub is not None:
2025-05-07T20:32:50.1471328Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.1471680Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.1472000Z             )
2025-05-07T20:32:50.1472198Z         else:
2025-05-07T20:32:50.1472421Z             scale_ub_tensor = None
2025-05-07T20:32:50.1472683Z     
2025-05-07T20:32:50.1472926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.1473254Z             op = silu_mul_quant
2025-05-07T20:32:50.1473513Z             if compiled:
2025-05-07T20:32:50.1473768Z                 op = torch.compile(op)
2025-05-07T20:32:50.1474079Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1474360Z     
2025-05-07T20:32:50.1474560Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.1474734Z 
2025-05-07T20:32:50.1474932Z moe/activation_test.py:117: 
2025-05-07T20:32:50.1475239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1475579Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.1475872Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1476590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.1477301Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.1477852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.1478550Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.1479229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.1479775Z     kernel = self.compile(
2025-05-07T20:32:50.1480333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.1481029Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.1481438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1481675Z 
2025-05-07T20:32:50.1481935Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6dbdc4280>
2025-05-07T20:32:50.1483052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.1484521Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6dbc2ce50>}
2025-05-07T20:32:50.1485892Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.1486945Z context = <triton._C.libtriton.ir.context object at 0x7fd6dbb6a030>
2025-05-07T20:32:50.1487285Z 
2025-05-07T20:32:50.1487460Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.1487999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.1488480Z                            module_map=module_map)
2025-05-07T20:32:50.1488861Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.1489233Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.1489497Z E       ^
2025-05-07T20:32:50.1489983Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.1490450Z 
2025-05-07T20:32:50.1490885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.1491405Z 
2025-05-07T20:32:50.1491519Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.1491939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.1492354Z     T=1,
2025-05-07T20:32:50.1492552Z     D=7168,
2025-05-07T20:32:50.1492749Z     scale_ub=None,
2025-05-07T20:32:50.1492978Z     contiguous=True,
2025-05-07T20:32:50.1493218Z     compiled=True,
2025-05-07T20:32:50.1493430Z )
2025-05-07T20:32:50.1493760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.1494255Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:50.1494518Z 
2025-05-07T20:32:50.1494604Z     @given(
2025-05-07T20:32:50.1494840Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.1495164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.1495479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.1495856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.1496197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.1496491Z     )
2025-05-07T20:32:50.1496845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.1497298Z     def test_silu_mul_quant(
2025-05-07T20:32:50.1497551Z         self,
2025-05-07T20:32:50.1497753Z         T: int,
2025-05-07T20:32:50.1497950Z         D: int,
2025-05-07T20:32:50.1498290Z         scale_ub: Optional[float],
2025-05-07T20:32:50.1498571Z         contiguous: bool,
2025-05-07T20:32:50.1498812Z         compiled: bool,
2025-05-07T20:32:50.1499043Z     ) -> None:
2025-05-07T20:32:50.1499269Z         torch.manual_seed(2025)
2025-05-07T20:32:50.1499513Z     
2025-05-07T20:32:50.1499796Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.1500147Z     
2025-05-07T20:32:50.1500343Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.1500648Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.1500966Z         x = x_sign * x_clamp
2025-05-07T20:32:50.1501209Z         x0 = x[:, :D]
2025-05-07T20:32:50.1501439Z         x1 = x[:, D:]
2025-05-07T20:32:50.1501680Z     
2025-05-07T20:32:50.1501895Z         if contiguous:
2025-05-07T20:32:50.1502181Z             x0 = x0.contiguous()
2025-05-07T20:32:50.1502450Z             x1 = x1.contiguous()
2025-05-07T20:32:50.1502691Z     
2025-05-07T20:32:50.1502894Z         if scale_ub is not None:
2025-05-07T20:32:50.1503223Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.1503570Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.1503880Z             )
2025-05-07T20:32:50.1504081Z         else:
2025-05-07T20:32:50.1504300Z             scale_ub_tensor = None
2025-05-07T20:32:50.1504554Z     
2025-05-07T20:32:50.1504794Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.1505117Z             op = silu_mul_quant
2025-05-07T20:32:50.1505373Z             if compiled:
2025-05-07T20:32:50.1505635Z                 op = torch.compile(op)
2025-05-07T20:32:50.1505941Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1506259Z     
2025-05-07T20:32:50.1506459Z         y_fp8, y_scale = fn()
2025-05-07T20:32:50.1506758Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:50.1507046Z     
2025-05-07T20:32:50.1507294Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.1507640Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:50.1507944Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:50.1508262Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:50.1508630Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.1508945Z     
2025-05-07T20:32:50.1509149Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:50.1509368Z 
2025-05-07T20:32:50.1509471Z moe/activation_test.py:126: 
2025-05-07T20:32:50.1509777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1510113Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:50.1510449Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.1511252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:50.1512073Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:50.1512630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.1513320Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.1514023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:50.1514760Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.1515569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:50.1516334Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.1517081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:50.1517735Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:50.1518344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:50.1518876Z     fn()
2025-05-07T20:32:50.1519401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:50.1519992Z     self.fn.run(
2025-05-07T20:32:50.1520467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.1521015Z     kernel = self.compile(
2025-05-07T20:32:50.1521577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.1522243Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.1522692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1522922Z 
2025-05-07T20:32:50.1523142Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6dbf5e6e0>
2025-05-07T20:32:50.1524285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.1525675Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6db9c09d0>}
2025-05-07T20:32:50.1527047Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.1528137Z context = <triton._C.libtriton.ir.context object at 0x7fd6dbaf0cf0>
2025-05-07T20:32:50.1528431Z 
2025-05-07T20:32:50.1528611Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.1529140Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.1529630Z                            module_map=module_map)
2025-05-07T20:32:50.1530007Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.1530379Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:50.1530653Z E       ^
2025-05-07T20:32:50.1531129Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.1531588Z 
2025-05-07T20:32:50.1532020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.1532544Z 
2025-05-07T20:32:50.1532659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.1533080Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.1533492Z     T=4096,
2025-05-07T20:32:50.1533690Z     D=5120,
2025-05-07T20:32:50.1533886Z     scale_ub=None,
2025-05-07T20:32:50.1534114Z     contiguous=False,
2025-05-07T20:32:50.1534356Z     compiled=False,
2025-05-07T20:32:50.1534565Z )
2025-05-07T20:32:51.7231371Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7231951Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.7232250Z 
2025-05-07T20:32:51.7232338Z     @given(
2025-05-07T20:32:51.7232767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7233673Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7234308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7234993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7235664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7236247Z     )
2025-05-07T20:32:51.7236974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7237885Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7238374Z         self,
2025-05-07T20:32:51.7238783Z         T: int,
2025-05-07T20:32:51.7239187Z         D: int,
2025-05-07T20:32:51.7239627Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7240183Z         contiguous: bool,
2025-05-07T20:32:51.7240671Z         compiled: bool,
2025-05-07T20:32:51.7241124Z     ) -> None:
2025-05-07T20:32:51.7241568Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7242060Z     
2025-05-07T20:32:51.7242618Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7243039Z     
2025-05-07T20:32:51.7243253Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.7243552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.7243875Z         x = x_sign * x_clamp
2025-05-07T20:32:51.7244126Z         x0 = x[:, :D]
2025-05-07T20:32:51.7244429Z         x1 = x[:, D:]
2025-05-07T20:32:51.7244652Z     
2025-05-07T20:32:51.7244850Z         if contiguous:
2025-05-07T20:32:51.7245087Z             x0 = x0.contiguous()
2025-05-07T20:32:51.7245358Z             x1 = x1.contiguous()
2025-05-07T20:32:51.7245692Z     
2025-05-07T20:32:51.7245895Z         if scale_ub is not None:
2025-05-07T20:32:51.7246175Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.7246523Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.7246840Z             )
2025-05-07T20:32:51.7247037Z         else:
2025-05-07T20:32:51.7247257Z             scale_ub_tensor = None
2025-05-07T20:32:51.7247517Z     
2025-05-07T20:32:51.7247761Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.7248089Z             op = silu_mul_quant
2025-05-07T20:32:51.7248348Z             if compiled:
2025-05-07T20:32:51.7248683Z                 op = torch.compile(op)
2025-05-07T20:32:51.7248998Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7249284Z     
2025-05-07T20:32:51.7249485Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.7249660Z 
2025-05-07T20:32:51.7249767Z moe/activation_test.py:117: 
2025-05-07T20:32:51.7250075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7250417Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.7250706Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7251421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.7252136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.7252690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.7253392Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.7254084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.7254637Z     kernel = self.compile(
2025-05-07T20:32:51.7255195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.7256105Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.7256516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7256745Z 
2025-05-07T20:32:51.7256967Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6db9fd1e0>
2025-05-07T20:32:51.7264440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.7265910Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6db9c1a20>}
2025-05-07T20:32:51.7267294Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.7268349Z context = <triton._C.libtriton.ir.context object at 0x7fd6db8235b0>
2025-05-07T20:32:51.7268645Z 
2025-05-07T20:32:51.7268824Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.7269352Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.7269831Z                            module_map=module_map)
2025-05-07T20:32:51.7270212Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.7270574Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.7270832Z E       ^
2025-05-07T20:32:51.7271381Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.7271843Z 
2025-05-07T20:32:51.7272276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.7272798Z 
2025-05-07T20:32:51.7272974Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7273395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7273807Z     T=4096,
2025-05-07T20:32:51.7274002Z     D=7168,
2025-05-07T20:32:51.7274195Z     scale_ub=None,
2025-05-07T20:32:51.7274424Z     contiguous=False,
2025-05-07T20:32:51.7274663Z     compiled=False,
2025-05-07T20:32:51.7274873Z )
2025-05-07T20:32:51.7275209Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7275740Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.7276083Z 
2025-05-07T20:32:51.7276170Z     @given(
2025-05-07T20:32:51.7276402Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7276728Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7277048Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7277387Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7277721Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7278019Z     )
2025-05-07T20:32:51.7278379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7278822Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7279073Z         self,
2025-05-07T20:32:51.7279277Z         T: int,
2025-05-07T20:32:51.7279476Z         D: int,
2025-05-07T20:32:51.7279705Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7279992Z         contiguous: bool,
2025-05-07T20:32:51.7280230Z         compiled: bool,
2025-05-07T20:32:51.7280465Z     ) -> None:
2025-05-07T20:32:51.7280689Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7280935Z     
2025-05-07T20:32:51.7281221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7281569Z     
2025-05-07T20:32:51.7281764Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.7282066Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.7282384Z         x = x_sign * x_clamp
2025-05-07T20:32:51.7282630Z         x0 = x[:, :D]
2025-05-07T20:32:51.7282846Z         x1 = x[:, D:]
2025-05-07T20:32:51.7283060Z     
2025-05-07T20:32:51.7283257Z         if contiguous:
2025-05-07T20:32:51.7283487Z             x0 = x0.contiguous()
2025-05-07T20:32:51.7283752Z             x1 = x1.contiguous()
2025-05-07T20:32:51.7283997Z     
2025-05-07T20:32:51.7284189Z         if scale_ub is not None:
2025-05-07T20:32:51.7284522Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.7284869Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.7285177Z             )
2025-05-07T20:32:51.7285379Z         else:
2025-05-07T20:32:51.7285586Z             scale_ub_tensor = None
2025-05-07T20:32:51.7285858Z     
2025-05-07T20:32:51.7286097Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.7286409Z             op = silu_mul_quant
2025-05-07T20:32:51.7286660Z             if compiled:
2025-05-07T20:32:51.7286914Z                 op = torch.compile(op)
2025-05-07T20:32:51.7287208Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7287483Z     
2025-05-07T20:32:51.7287679Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.7287846Z 
2025-05-07T20:32:51.7287956Z moe/activation_test.py:117: 
2025-05-07T20:32:51.7288249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7288582Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.7288876Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7289575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.7290278Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.7290870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.7291568Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.7292274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.7292868Z     kernel = self.compile(
2025-05-07T20:32:51.7293423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.7294091Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.7294499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7294734Z 
2025-05-07T20:32:51.7294950Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6db8ab760>
2025-05-07T20:32:51.7296094Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.7297487Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6db9c2560>}
2025-05-07T20:32:51.7298952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.7299995Z context = <triton._C.libtriton.ir.context object at 0x7fd6db8ba770>
2025-05-07T20:32:51.7300297Z 
2025-05-07T20:32:51.7300467Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.7300996Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.7301470Z                            module_map=module_map)
2025-05-07T20:32:51.7301844Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.7302209Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.7302469Z E       ^
2025-05-07T20:32:51.7302983Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.7303463Z 
2025-05-07T20:32:51.7303882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.7304399Z 
2025-05-07T20:32:51.7304510Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7304970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7305375Z     T=128,
2025-05-07T20:32:51.7305566Z     D=7168,
2025-05-07T20:32:51.7305755Z     scale_ub=None,
2025-05-07T20:32:51.7305981Z     contiguous=False,
2025-05-07T20:32:51.7306208Z     compiled=True,
2025-05-07T20:32:51.7306408Z )
2025-05-07T20:32:51.7922216Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7922784Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.7923163Z 
2025-05-07T20:32:51.7923286Z     @given(
2025-05-07T20:32:51.7923615Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7924042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7924450Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7924852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7925198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7925489Z     )
2025-05-07T20:32:51.7925861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7926319Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7926572Z         self,
2025-05-07T20:32:51.7926777Z         T: int,
2025-05-07T20:32:51.7926991Z         D: int,
2025-05-07T20:32:51.7927344Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7927636Z         contiguous: bool,
2025-05-07T20:32:51.7927889Z         compiled: bool,
2025-05-07T20:32:51.7928130Z     ) -> None:
2025-05-07T20:32:51.7928355Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7928678Z     
2025-05-07T20:32:51.7928970Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7929326Z     
2025-05-07T20:32:51.7929539Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.7929853Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.7930206Z         x = x_sign * x_clamp
2025-05-07T20:32:51.7930456Z         x0 = x[:, :D]
2025-05-07T20:32:51.7930692Z         x1 = x[:, D:]
2025-05-07T20:32:51.7930922Z     
2025-05-07T20:32:51.7931117Z         if contiguous:
2025-05-07T20:32:51.7931368Z             x0 = x0.contiguous()
2025-05-07T20:32:51.7931641Z             x1 = x1.contiguous()
2025-05-07T20:32:51.7931959Z     
2025-05-07T20:32:51.7932165Z         if scale_ub is not None:
2025-05-07T20:32:51.7932461Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.7932868Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.7933184Z             )
2025-05-07T20:32:51.7933394Z         else:
2025-05-07T20:32:51.7933618Z             scale_ub_tensor = None
2025-05-07T20:32:51.7933875Z     
2025-05-07T20:32:51.7934121Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.7934449Z             op = silu_mul_quant
2025-05-07T20:32:51.7934705Z             if compiled:
2025-05-07T20:32:51.7934968Z                 op = torch.compile(op)
2025-05-07T20:32:51.7935279Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7935562Z     
2025-05-07T20:32:51.7935773Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.7936074Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.7936373Z     
2025-05-07T20:32:51.7936632Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.7936984Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.7937289Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.7937611Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.7937981Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.7938459Z     
2025-05-07T20:32:51.7938671Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.7938879Z 
2025-05-07T20:32:51.7938986Z moe/activation_test.py:126: 
2025-05-07T20:32:51.7939295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7939639Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.7940059Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.7940877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.7941657Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.7942216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.7942923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.7943637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.7944379Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.7945146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:51.7945916Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.7946667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.7947323Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.7947994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.7948531Z     fn()
2025-05-07T20:32:51.7949060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.7949717Z     self.fn.run(
2025-05-07T20:32:51.7950200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.7950752Z     kernel = self.compile(
2025-05-07T20:32:51.7951304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.7951978Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.7952397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7952709Z 
2025-05-07T20:32:51.7952933Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6db593640>
2025-05-07T20:32:51.7954038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.7955442Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6db9ee3b0>}
2025-05-07T20:32:51.7957244Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.7958300Z context = <triton._C.libtriton.ir.context object at 0x7fd6db13c530>
2025-05-07T20:32:51.7958595Z 
2025-05-07T20:32:51.7958780Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.7959316Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.7959803Z                            module_map=module_map)
2025-05-07T20:32:51.7960183Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.7960557Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.7960830Z E       ^
2025-05-07T20:32:51.7961309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.7961766Z 
2025-05-07T20:32:51.7962200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.7962745Z 
2025-05-07T20:32:51.7962977Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7963405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7963816Z     T=128,
2025-05-07T20:32:51.7964021Z     D=7168,
2025-05-07T20:32:51.7964219Z     scale_ub=None,
2025-05-07T20:32:51.7964450Z     contiguous=False,
2025-05-07T20:32:51.7964690Z     compiled=False,
2025-05-07T20:32:51.7964904Z )
2025-05-07T20:32:52.1552932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.1553461Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.1553763Z 
2025-05-07T20:32:52.1553863Z     @given(
2025-05-07T20:32:52.1554229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.1554622Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.1555033Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.1555375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.1555947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.1556245Z     )
2025-05-07T20:32:52.1556609Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.1557059Z     def test_silu_mul_quant(
2025-05-07T20:32:52.1557310Z         self,
2025-05-07T20:32:52.1557634Z         T: int,
2025-05-07T20:32:52.1557835Z         D: int,
2025-05-07T20:32:52.1558062Z         scale_ub: Optional[float],
2025-05-07T20:32:52.1558346Z         contiguous: bool,
2025-05-07T20:32:52.1558587Z         compiled: bool,
2025-05-07T20:32:52.1558926Z     ) -> None:
2025-05-07T20:32:52.1559145Z         torch.manual_seed(2025)
2025-05-07T20:32:52.1559397Z     
2025-05-07T20:32:52.1559679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.1560022Z     
2025-05-07T20:32:52.1560223Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.1560524Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.1560834Z         x = x_sign * x_clamp
2025-05-07T20:32:52.1561084Z         x0 = x[:, :D]
2025-05-07T20:32:52.1561314Z         x1 = x[:, D:]
2025-05-07T20:32:52.1561521Z     
2025-05-07T20:32:52.1561786Z         if contiguous:
2025-05-07T20:32:52.1562029Z             x0 = x0.contiguous()
2025-05-07T20:32:52.1562291Z             x1 = x1.contiguous()
2025-05-07T20:32:52.1562544Z     
2025-05-07T20:32:52.1562782Z         if scale_ub is not None:
2025-05-07T20:32:52.1563075Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.1563422Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.1563745Z             )
2025-05-07T20:32:52.1563939Z         else:
2025-05-07T20:32:52.1564159Z             scale_ub_tensor = None
2025-05-07T20:32:52.1564420Z     
2025-05-07T20:32:52.1564665Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.1564980Z             op = silu_mul_quant
2025-05-07T20:32:52.1565238Z             if compiled:
2025-05-07T20:32:52.1565493Z                 op = torch.compile(op)
2025-05-07T20:32:52.1565797Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1566081Z     
2025-05-07T20:32:52.1566281Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.1566455Z 
2025-05-07T20:32:52.1566557Z moe/activation_test.py:117: 
2025-05-07T20:32:52.1566864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1567197Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.1567480Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1568191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.1568896Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.1569443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.1570131Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.1570877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.1571429Z     kernel = self.compile(
2025-05-07T20:32:52.1571983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.1572647Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.1573051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1573280Z 
2025-05-07T20:32:52.1573500Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6db5f78e0>
2025-05-07T20:32:52.1574601Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.1576010Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6dba2e830>}
2025-05-07T20:32:52.1577422Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.1578595Z context = <triton._C.libtriton.ir.context object at 0x7fd6db1c25f0>
2025-05-07T20:32:52.1578888Z 
2025-05-07T20:32:52.1579235Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.1579815Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.1580299Z                            module_map=module_map)
2025-05-07T20:32:52.1580670Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.1581034Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.1581296Z E       ^
2025-05-07T20:32:52.1581776Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.1582234Z 
2025-05-07T20:32:52.1582663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.1583229Z 
2025-05-07T20:32:52.1583341Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.1583770Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.1584182Z     T=4096,
2025-05-07T20:32:52.1584380Z     D=5120,
2025-05-07T20:32:52.1584572Z     scale_ub=1200.0,
2025-05-07T20:32:52.1584805Z     contiguous=True,
2025-05-07T20:32:52.1585035Z     compiled=False,
2025-05-07T20:32:52.1585240Z )
2025-05-07T20:32:52.1585578Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.1586083Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:52.1586361Z 
2025-05-07T20:32:52.1586441Z     @given(
2025-05-07T20:32:52.1586684Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.1587010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.1587323Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.1587666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.1588005Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.1588301Z     )
2025-05-07T20:32:52.1588652Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.1589107Z     def test_silu_mul_quant(
2025-05-07T20:32:52.1589359Z         self,
2025-05-07T20:32:52.1589554Z         T: int,
2025-05-07T20:32:52.1589760Z         D: int,
2025-05-07T20:32:52.1589986Z         scale_ub: Optional[float],
2025-05-07T20:32:52.1590260Z         contiguous: bool,
2025-05-07T20:32:52.1590507Z         compiled: bool,
2025-05-07T20:32:52.1590735Z     ) -> None:
2025-05-07T20:32:52.1590955Z         torch.manual_seed(2025)
2025-05-07T20:32:52.1591257Z     
2025-05-07T20:32:52.1591545Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.1591889Z     
2025-05-07T20:32:52.1592094Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.1592395Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.1592749Z         x = x_sign * x_clamp
2025-05-07T20:32:52.1593003Z         x0 = x[:, :D]
2025-05-07T20:32:52.1593222Z         x1 = x[:, D:]
2025-05-07T20:32:52.1593434Z     
2025-05-07T20:32:52.1593619Z         if contiguous:
2025-05-07T20:32:52.1593859Z             x0 = x0.contiguous()
2025-05-07T20:32:52.1594121Z             x1 = x1.contiguous()
2025-05-07T20:32:52.1594362Z     
2025-05-07T20:32:52.1594564Z         if scale_ub is not None:
2025-05-07T20:32:52.1594843Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.1595178Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.1595490Z             )
2025-05-07T20:32:52.1595691Z         else:
2025-05-07T20:32:52.1595905Z             scale_ub_tensor = None
2025-05-07T20:32:52.1596160Z     
2025-05-07T20:32:52.1596399Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.1596716Z             op = silu_mul_quant
2025-05-07T20:32:52.1596967Z             if compiled:
2025-05-07T20:32:52.1597268Z                 op = torch.compile(op)
2025-05-07T20:32:52.1597568Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1597847Z     
2025-05-07T20:32:52.1598048Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.1598259Z 
2025-05-07T20:32:52.1598369Z moe/activation_test.py:117: 
2025-05-07T20:32:52.1598660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1598992Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.1599280Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1599979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.1600689Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.1601235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.1601975Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.1602647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.1603242Z     kernel = self.compile(
2025-05-07T20:32:52.1603797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.1604459Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.1604858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1605094Z 
2025-05-07T20:32:52.1605307Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6db4ae800>
2025-05-07T20:32:52.1606407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.1607804Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6dba2db40>}
2025-05-07T20:32:52.1609171Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.1610216Z context = <triton._C.libtriton.ir.context object at 0x7fd6db1f76b0>
2025-05-07T20:32:52.1610507Z 
2025-05-07T20:32:52.1610681Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.1611254Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.1611726Z                            module_map=module_map)
2025-05-07T20:32:52.1612096Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.1612459Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.1612717Z E       ^
2025-05-07T20:32:52.1613192Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.1613649Z 
2025-05-07T20:32:52.1614075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.1614594Z 
2025-05-07T20:32:52.1614706Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.1615121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.1615527Z     T=1,
2025-05-07T20:32:52.1615719Z     D=5120,
2025-05-07T20:32:52.1615910Z     scale_ub=None,
2025-05-07T20:32:52.1616129Z     contiguous=True,
2025-05-07T20:32:52.1616358Z     compiled=True,
2025-05-07T20:32:52.1616561Z )
2025-05-07T20:32:52.7392014Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.7392662Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.7392974Z 
2025-05-07T20:32:52.7393208Z     @given(
2025-05-07T20:32:52.7393450Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.7400489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.7400821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.7401272Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.7401782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.7402072Z     )
2025-05-07T20:32:52.7402433Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.7402873Z     def test_silu_mul_quant(
2025-05-07T20:32:52.7403159Z         self,
2025-05-07T20:32:52.7403376Z         T: int,
2025-05-07T20:32:52.7403580Z         D: int,
2025-05-07T20:32:52.7403803Z         scale_ub: Optional[float],
2025-05-07T20:32:52.7404077Z         contiguous: bool,
2025-05-07T20:32:52.7404390Z         compiled: bool,
2025-05-07T20:32:52.7404625Z     ) -> None:
2025-05-07T20:32:52.7404854Z         torch.manual_seed(2025)
2025-05-07T20:32:52.7405094Z     
2025-05-07T20:32:52.7405381Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.7405733Z     
2025-05-07T20:32:52.7405926Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.7406229Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.7406544Z         x = x_sign * x_clamp
2025-05-07T20:32:52.7406782Z         x0 = x[:, :D]
2025-05-07T20:32:52.7407008Z         x1 = x[:, D:]
2025-05-07T20:32:52.7407220Z     
2025-05-07T20:32:52.7407412Z         if contiguous:
2025-05-07T20:32:52.7407643Z             x0 = x0.contiguous()
2025-05-07T20:32:52.7407906Z             x1 = x1.contiguous()
2025-05-07T20:32:52.7408152Z     
2025-05-07T20:32:52.7408348Z         if scale_ub is not None:
2025-05-07T20:32:52.7408634Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.7408979Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.7409287Z             )
2025-05-07T20:32:52.7409492Z         else:
2025-05-07T20:32:52.7409708Z             scale_ub_tensor = None
2025-05-07T20:32:52.7409956Z     
2025-05-07T20:32:52.7410199Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.7410522Z             op = silu_mul_quant
2025-05-07T20:32:52.7410775Z             if compiled:
2025-05-07T20:32:52.7411030Z                 op = torch.compile(op)
2025-05-07T20:32:52.7411333Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.7411606Z     
2025-05-07T20:32:52.7411806Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.7412097Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.7412391Z     
2025-05-07T20:32:52.7412705Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.7413046Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.7413350Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.7413670Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.7414041Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.7414352Z     
2025-05-07T20:32:52.7414553Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.7414757Z 
2025-05-07T20:32:52.7414859Z moe/activation_test.py:126: 
2025-05-07T20:32:52.7415159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.7415494Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.7415822Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.7416628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.7417397Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.7417940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.7418746Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.7419498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.7420226Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.7421027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:52.7421781Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.7422509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.7423157Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.7423768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.7424330Z     fn()
2025-05-07T20:32:52.7424841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.7425426Z     self.fn.run(
2025-05-07T20:32:52.7425903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.7426442Z     kernel = self.compile(
2025-05-07T20:32:52.7426994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.7427662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.7428062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.7428287Z 
2025-05-07T20:32:52.7428499Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6db8a80d0>
2025-05-07T20:32:52.7429606Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.7431019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6dba2f250>}
2025-05-07T20:32:52.7432389Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.7433432Z context = <triton._C.libtriton.ir.context object at 0x7fd6dac0b630>
2025-05-07T20:32:52.7433722Z 
2025-05-07T20:32:52.7433891Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.7434469Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.7434946Z                            module_map=module_map)
2025-05-07T20:32:52.7435313Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.7435679Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.7435948Z E       ^
2025-05-07T20:32:52.7436420Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.7436879Z 
2025-05-07T20:32:52.7437301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.7437826Z 
2025-05-07T20:32:52.7437933Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.7438354Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.7438763Z     T=2048,
2025-05-07T20:32:52.7438949Z     D=5120,
2025-05-07T20:32:52.7439152Z     scale_ub=None,
2025-05-07T20:32:52.7439371Z     contiguous=True,
2025-05-07T20:32:52.7439597Z     compiled=True,
2025-05-07T20:32:52.7439812Z )
2025-05-07T20:32:53.2806458Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.2807148Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:53.2807434Z 
2025-05-07T20:32:53.2807520Z     @given(
2025-05-07T20:32:53.2807766Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.2808147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.2808473Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.2808823Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.2809163Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.2809461Z     )
2025-05-07T20:32:53.2809832Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.2810296Z     def test_silu_mul_quant(
2025-05-07T20:32:53.2810547Z         self,
2025-05-07T20:32:53.2810758Z         T: int,
2025-05-07T20:32:53.2810971Z         D: int,
2025-05-07T20:32:53.2811265Z         scale_ub: Optional[float],
2025-05-07T20:32:53.2811554Z         contiguous: bool,
2025-05-07T20:32:53.2811811Z         compiled: bool,
2025-05-07T20:32:53.2812044Z     ) -> None:
2025-05-07T20:32:53.2812279Z         torch.manual_seed(2025)
2025-05-07T20:32:53.2812536Z     
2025-05-07T20:32:53.2812818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.2813179Z     
2025-05-07T20:32:53.2813386Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.2813687Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.2814013Z         x = x_sign * x_clamp
2025-05-07T20:32:53.2814265Z         x0 = x[:, :D]
2025-05-07T20:32:53.2814489Z         x1 = x[:, D:]
2025-05-07T20:32:53.2814708Z     
2025-05-07T20:32:53.2814908Z         if contiguous:
2025-05-07T20:32:53.2815152Z             x0 = x0.contiguous()
2025-05-07T20:32:53.2815426Z             x1 = x1.contiguous()
2025-05-07T20:32:53.2815680Z     
2025-05-07T20:32:53.2815881Z         if scale_ub is not None:
2025-05-07T20:32:53.2816175Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.2816530Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.2816853Z             )
2025-05-07T20:32:53.2817054Z         else:
2025-05-07T20:32:53.2817277Z             scale_ub_tensor = None
2025-05-07T20:32:53.2817544Z     
2025-05-07T20:32:53.2817788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.2818214Z             op = silu_mul_quant
2025-05-07T20:32:53.2818477Z             if compiled:
2025-05-07T20:32:53.2818734Z                 op = torch.compile(op)
2025-05-07T20:32:53.2819052Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.2819342Z     
2025-05-07T20:32:53.2819544Z         y_fp8, y_scale = fn()
2025-05-07T20:32:53.2819917Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:53.2820222Z     
2025-05-07T20:32:53.2820471Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.2820826Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:53.2821137Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:53.2821468Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:53.2821839Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.2822167Z     
2025-05-07T20:32:53.2822391Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:53.2822594Z 
2025-05-07T20:32:53.2822701Z moe/activation_test.py:126: 
2025-05-07T20:32:53.2823015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.2823363Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:53.2823703Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.2824520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:53.2825298Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:53.2825866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.2826607Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.2827315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:53.2828098Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.2828865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:53.2829625Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.2830374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:53.2831029Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:53.2831737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:53.2832263Z     fn()
2025-05-07T20:32:53.2832783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:53.2833378Z     self.fn.run(
2025-05-07T20:32:53.2833858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.2834402Z     kernel = self.compile(
2025-05-07T20:32:53.2834958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.2835630Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.2836032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.2836267Z 
2025-05-07T20:32:53.2836482Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6dac3e080>
2025-05-07T20:32:53.2837599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.2839005Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6db5d3760>}
2025-05-07T20:32:53.2840377Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.2841429Z context = <triton._C.libtriton.ir.context object at 0x7fd6dacf48b0>
2025-05-07T20:32:53.2841837Z 
2025-05-07T20:32:53.2842012Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.2842548Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.2843075Z                            module_map=module_map)
2025-05-07T20:32:53.2843458Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.2843827Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:53.2844101Z E       ^
2025-05-07T20:32:53.2844586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.2845050Z 
2025-05-07T20:32:53.2845473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.2845992Z 
2025-05-07T20:32:53.2846108Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.2846531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.2846946Z     T=128,
2025-05-07T20:32:53.2847144Z     D=5120,
2025-05-07T20:32:53.2847348Z     scale_ub=None,
2025-05-07T20:32:53.2847574Z     contiguous=True,
2025-05-07T20:32:53.2847808Z     compiled=True,
2025-05-07T20:32:53.2848022Z )
2025-05-07T20:32:54.1735090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1735894Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.1736277Z 
2025-05-07T20:32:54.1736489Z     @given(
2025-05-07T20:32:54.1736730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1737054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1737373Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1737711Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1738139Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1738436Z     )
2025-05-07T20:32:54.1738802Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1739252Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1739503Z         self,
2025-05-07T20:32:54.1739790Z         T: int,
2025-05-07T20:32:54.1739990Z         D: int,
2025-05-07T20:32:54.1740222Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1740502Z         contiguous: bool,
2025-05-07T20:32:54.1740745Z         compiled: bool,
2025-05-07T20:32:54.1740978Z     ) -> None:
2025-05-07T20:32:54.1741202Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1741449Z     
2025-05-07T20:32:54.1741731Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1742087Z     
2025-05-07T20:32:54.1742283Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1742585Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1742906Z         x = x_sign * x_clamp
2025-05-07T20:32:54.1743148Z         x0 = x[:, :D]
2025-05-07T20:32:54.1743372Z         x1 = x[:, D:]
2025-05-07T20:32:54.1743592Z     
2025-05-07T20:32:54.1743781Z         if contiguous:
2025-05-07T20:32:54.1744018Z             x0 = x0.contiguous()
2025-05-07T20:32:54.1744288Z             x1 = x1.contiguous()
2025-05-07T20:32:54.1744535Z     
2025-05-07T20:32:54.1744730Z         if scale_ub is not None:
2025-05-07T20:32:54.1745015Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.1745362Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.1745669Z             )
2025-05-07T20:32:54.1745876Z         else:
2025-05-07T20:32:54.1746095Z             scale_ub_tensor = None
2025-05-07T20:32:54.1746348Z     
2025-05-07T20:32:54.1746591Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1746916Z             op = silu_mul_quant
2025-05-07T20:32:54.1747169Z             if compiled:
2025-05-07T20:32:54.1747429Z                 op = torch.compile(op)
2025-05-07T20:32:54.1747742Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.1748017Z     
2025-05-07T20:32:54.1748298Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.1748600Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.1748895Z     
2025-05-07T20:32:54.1749144Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.1749493Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.1749798Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.1750117Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.1750488Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.1750806Z     
2025-05-07T20:32:54.1751012Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.1751221Z 
2025-05-07T20:32:54.1751327Z moe/activation_test.py:126: 
2025-05-07T20:32:54.1751631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1751967Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.1752308Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.1753118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.1753894Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.1754497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.1755199Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.1756186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.1756928Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.1757696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:54.1758463Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.1759204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.1759933Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.1760544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.1761074Z     fn()
2025-05-07T20:32:54.1761593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.1762187Z     self.fn.run(
2025-05-07T20:32:54.1762672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.1763216Z     kernel = self.compile(
2025-05-07T20:32:54.1763768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.1764440Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.1764847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.1765078Z 
2025-05-07T20:32:54.1765296Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6dad85b70>
2025-05-07T20:32:54.1766401Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.1767817Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da770280>}
2025-05-07T20:32:54.1769253Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.1770302Z context = <triton._C.libtriton.ir.context object at 0x7fd6daabb7f0>
2025-05-07T20:32:54.1770595Z 
2025-05-07T20:32:54.1770773Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.1771302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.1771788Z                            module_map=module_map)
2025-05-07T20:32:54.1772166Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.1772538Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.1772815Z E       ^
2025-05-07T20:32:54.1773344Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.1773799Z 
2025-05-07T20:32:54.1774226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.1774743Z 
2025-05-07T20:32:54.1774856Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1775281Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1775692Z     T=4096,
2025-05-07T20:32:54.1775880Z     D=5120,
2025-05-07T20:32:54.1776082Z     scale_ub=None,
2025-05-07T20:32:54.1776384Z     contiguous=True,
2025-05-07T20:32:54.1776614Z     compiled=True,
2025-05-07T20:32:54.1776819Z )
2025-05-07T20:32:54.9189043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.9189979Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.9190367Z 
2025-05-07T20:32:54.9190479Z     @given(
2025-05-07T20:32:54.9190741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.9191057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.9191370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.9191711Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.9192056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.9192341Z     )
2025-05-07T20:32:54.9192699Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.9193233Z     def test_silu_mul_quant(
2025-05-07T20:32:54.9193476Z         self,
2025-05-07T20:32:54.9193682Z         T: int,
2025-05-07T20:32:54.9193886Z         D: int,
2025-05-07T20:32:54.9194103Z         scale_ub: Optional[float],
2025-05-07T20:32:54.9194381Z         contiguous: bool,
2025-05-07T20:32:54.9194630Z         compiled: bool,
2025-05-07T20:32:54.9194853Z     ) -> None:
2025-05-07T20:32:54.9195076Z         torch.manual_seed(2025)
2025-05-07T20:32:54.9195325Z     
2025-05-07T20:32:54.9195607Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.9195955Z     
2025-05-07T20:32:54.9196153Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.9196446Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.9196766Z         x = x_sign * x_clamp
2025-05-07T20:32:54.9197014Z         x0 = x[:, :D]
2025-05-07T20:32:54.9197238Z         x1 = x[:, D:]
2025-05-07T20:32:54.9197446Z     
2025-05-07T20:32:54.9197652Z         if contiguous:
2025-05-07T20:32:54.9197890Z             x0 = x0.contiguous()
2025-05-07T20:32:54.9198150Z             x1 = x1.contiguous()
2025-05-07T20:32:54.9198399Z     
2025-05-07T20:32:54.9198600Z         if scale_ub is not None:
2025-05-07T20:32:54.9198876Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.9199220Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.9199538Z             )
2025-05-07T20:32:54.9199731Z         else:
2025-05-07T20:32:54.9199951Z             scale_ub_tensor = None
2025-05-07T20:32:54.9200211Z     
2025-05-07T20:32:54.9200445Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9200769Z             op = silu_mul_quant
2025-05-07T20:32:54.9201026Z             if compiled:
2025-05-07T20:32:54.9201343Z                 op = torch.compile(op)
2025-05-07T20:32:54.9201653Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.9201931Z     
2025-05-07T20:32:54.9202123Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.9202417Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.9202718Z     
2025-05-07T20:32:54.9202966Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.9203304Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.9203602Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.9203924Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.9204287Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.9204601Z     
2025-05-07T20:32:54.9204809Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.9205007Z 
2025-05-07T20:32:54.9205110Z moe/activation_test.py:126: 
2025-05-07T20:32:54.9205410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9205753Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.9206091Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.9206961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.9207734Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.9208293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.9209023Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.9209723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.9210461Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.9211234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:54.9211992Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.9212807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.9213469Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.9214119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.9214641Z     fn()
2025-05-07T20:32:54.9215156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.9215760Z     self.fn.run(
2025-05-07T20:32:54.9222328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.9222882Z     kernel = self.compile(
2025-05-07T20:32:54.9223492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.9224153Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.9224556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.9224789Z 
2025-05-07T20:32:54.9225006Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6dad877f0>
2025-05-07T20:32:54.9226112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.9227511Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da7712d0>}
2025-05-07T20:32:54.9228956Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.9230003Z context = <triton._C.libtriton.ir.context object at 0x7fd6da992af0>
2025-05-07T20:32:54.9230299Z 
2025-05-07T20:32:54.9230476Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.9231000Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.9231480Z                            module_map=module_map)
2025-05-07T20:32:54.9231855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.9232221Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.9232484Z E       ^
2025-05-07T20:32:54.9232964Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.9233422Z 
2025-05-07T20:32:54.9233853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.9234375Z 
2025-05-07T20:32:54.9234482Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.9234907Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.9235320Z     T=16384,
2025-05-07T20:32:54.9235575Z     D=5120,
2025-05-07T20:32:54.9235774Z     scale_ub=None,
2025-05-07T20:32:54.9236000Z     contiguous=True,
2025-05-07T20:32:54.9236231Z     compiled=True,
2025-05-07T20:32:54.9236437Z )
2025-05-07T20:32:54.9624028Z W0507 20:32:54.960000 88308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:54.9626802Z W0507 20:32:54.960000 88308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:54.9629307Z W0507 20:32:54.960000 88308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:54.9631151Z W0507 20:32:54.960000 88308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:54.9633431Z W0507 20:32:54.960000 88308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:55.0657967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.0658830Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.0659226Z 
2025-05-07T20:32:55.0659325Z     @given(
2025-05-07T20:32:55.0659567Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.0659896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.0660215Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.0660569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.0660909Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.0661213Z     )
2025-05-07T20:32:55.0661580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.0662027Z     def test_silu_mul_quant(
2025-05-07T20:32:55.0662279Z         self,
2025-05-07T20:32:55.0662486Z         T: int,
2025-05-07T20:32:55.0662685Z         D: int,
2025-05-07T20:32:55.0662920Z         scale_ub: Optional[float],
2025-05-07T20:32:55.0663207Z         contiguous: bool,
2025-05-07T20:32:55.0663449Z         compiled: bool,
2025-05-07T20:32:55.0663684Z     ) -> None:
2025-05-07T20:32:55.0663911Z         torch.manual_seed(2025)
2025-05-07T20:32:55.0664160Z     
2025-05-07T20:32:55.0664445Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.0664799Z     
2025-05-07T20:32:55.0664995Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.0665669Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.0666000Z         x = x_sign * x_clamp
2025-05-07T20:32:55.0666243Z         x0 = x[:, :D]
2025-05-07T20:32:55.0666474Z         x1 = x[:, D:]
2025-05-07T20:32:55.0666689Z     
2025-05-07T20:32:55.0666887Z         if contiguous:
2025-05-07T20:32:55.0667130Z             x0 = x0.contiguous()
2025-05-07T20:32:55.0667397Z             x1 = x1.contiguous()
2025-05-07T20:32:55.0667639Z     
2025-05-07T20:32:55.0667839Z         if scale_ub is not None:
2025-05-07T20:32:55.0668124Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.0668465Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.0668782Z             )
2025-05-07T20:32:55.0668985Z         else:
2025-05-07T20:32:55.0669200Z             scale_ub_tensor = None
2025-05-07T20:32:55.0669465Z     
2025-05-07T20:32:55.0669707Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.0670030Z             op = silu_mul_quant
2025-05-07T20:32:55.0670288Z             if compiled:
2025-05-07T20:32:55.0670546Z                 op = torch.compile(op)
2025-05-07T20:32:55.0670848Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.0671133Z     
2025-05-07T20:32:55.0671333Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.0671702Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.0671996Z     
2025-05-07T20:32:55.0672242Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.0672649Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.0672941Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.0673265Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.0673679Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.0673989Z     
2025-05-07T20:32:55.0674196Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.0674400Z 
2025-05-07T20:32:55.0674510Z moe/activation_test.py:126: 
2025-05-07T20:32:55.0674823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.0675155Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.0675565Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.0676367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.0677127Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.0677685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.0678380Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.0679076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.0679810Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.0680575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.0681341Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.0682084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.0682735Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.0683352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.0683926Z     fn()
2025-05-07T20:32:55.0684440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.0685031Z     self.fn.run(
2025-05-07T20:32:55.0685515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.0686101Z     kernel = self.compile(
2025-05-07T20:32:55.0686654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.0687336Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.0687736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.0687966Z 
2025-05-07T20:32:55.0688180Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da50cca0>
2025-05-07T20:32:55.0689291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.0690693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daaa1510>}
2025-05-07T20:32:55.0692064Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.0693163Z context = <triton._C.libtriton.ir.context object at 0x7fd6da272730>
2025-05-07T20:32:55.0693456Z 
2025-05-07T20:32:55.0693626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.0694156Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.0694688Z                            module_map=module_map)
2025-05-07T20:32:55.0695059Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.0695419Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.0695696Z E       ^
2025-05-07T20:32:55.0696171Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.0696635Z 
2025-05-07T20:32:55.0697058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.0697632Z 
2025-05-07T20:32:55.0697741Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.0698234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.0698644Z     T=1,
2025-05-07T20:32:55.0698834Z     D=5120,
2025-05-07T20:32:55.0699036Z     scale_ub=1200.0,
2025-05-07T20:32:55.0699282Z     contiguous=True,
2025-05-07T20:32:55.0699513Z     compiled=True,
2025-05-07T20:32:55.0699733Z )
2025-05-07T20:32:55.2144052Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.2144825Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.2145198Z 
2025-05-07T20:32:55.2145310Z     @given(
2025-05-07T20:32:55.2145565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.2145892Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.2146210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.2146547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.2146890Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.2147185Z     )
2025-05-07T20:32:55.2147539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.2147988Z     def test_silu_mul_quant(
2025-05-07T20:32:55.2148234Z         self,
2025-05-07T20:32:55.2148433Z         T: int,
2025-05-07T20:32:55.2148632Z         D: int,
2025-05-07T20:32:55.2148861Z         scale_ub: Optional[float],
2025-05-07T20:32:55.2149133Z         contiguous: bool,
2025-05-07T20:32:55.2149379Z         compiled: bool,
2025-05-07T20:32:55.2149605Z     ) -> None:
2025-05-07T20:32:55.2149819Z         torch.manual_seed(2025)
2025-05-07T20:32:55.2150066Z     
2025-05-07T20:32:55.2150467Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.2150815Z     
2025-05-07T20:32:55.2151014Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.2151315Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.2151633Z         x = x_sign * x_clamp
2025-05-07T20:32:55.2151870Z         x0 = x[:, :D]
2025-05-07T20:32:55.2152093Z         x1 = x[:, D:]
2025-05-07T20:32:55.2152304Z     
2025-05-07T20:32:55.2152490Z         if contiguous:
2025-05-07T20:32:55.2152727Z             x0 = x0.contiguous()
2025-05-07T20:32:55.2152991Z             x1 = x1.contiguous()
2025-05-07T20:32:55.2153234Z     
2025-05-07T20:32:55.2153433Z         if scale_ub is not None:
2025-05-07T20:32:55.2153742Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.2154110Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.2154425Z             )
2025-05-07T20:32:55.2154621Z         else:
2025-05-07T20:32:55.2154832Z             scale_ub_tensor = None
2025-05-07T20:32:55.2155089Z     
2025-05-07T20:32:55.2155330Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.2155905Z             op = silu_mul_quant
2025-05-07T20:32:55.2156163Z             if compiled:
2025-05-07T20:32:55.2156419Z                 op = torch.compile(op)
2025-05-07T20:32:55.2156817Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.2157092Z     
2025-05-07T20:32:55.2157285Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.2157452Z 
2025-05-07T20:32:55.2157556Z moe/activation_test.py:117: 
2025-05-07T20:32:55.2157848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.2158269Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.2158558Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.2159124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.2159692Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.2160366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.2161073Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.2161687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.2162389Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.2163066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.2163613Z     kernel = self.compile(
2025-05-07T20:32:55.2164162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.2164833Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.2165235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.2165462Z 
2025-05-07T20:32:55.2165683Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da59f2b0>
2025-05-07T20:32:55.2166786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.2168195Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da692b90>}
2025-05-07T20:32:55.2169573Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.2170620Z context = <triton._C.libtriton.ir.context object at 0x7fd6da2dc8b0>
2025-05-07T20:32:55.2170912Z 
2025-05-07T20:32:55.2171086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.2171681Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.2172161Z                            module_map=module_map)
2025-05-07T20:32:55.2172534Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.2172892Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.2173156Z E       ^
2025-05-07T20:32:55.2173678Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.2174139Z 
2025-05-07T20:32:55.2174567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.2175089Z 
2025-05-07T20:32:55.2175195Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.2175613Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.2176020Z     T=1,
2025-05-07T20:32:55.2176203Z     D=5120,
2025-05-07T20:32:55.2176406Z     scale_ub=None,
2025-05-07T20:32:55.2176626Z     contiguous=False,
2025-05-07T20:32:55.2176849Z     compiled=True,
2025-05-07T20:32:55.2177054Z )
2025-05-07T20:32:55.2853037Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.2854347Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.2854713Z 
2025-05-07T20:32:55.2854802Z     @given(
2025-05-07T20:32:55.2855038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.2855427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.2856030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.2856370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.2856708Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.2857000Z     )
2025-05-07T20:32:55.2857361Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.2857809Z     def test_silu_mul_quant(
2025-05-07T20:32:55.2858160Z         self,
2025-05-07T20:32:55.2858364Z         T: int,
2025-05-07T20:32:55.2858561Z         D: int,
2025-05-07T20:32:55.2858786Z         scale_ub: Optional[float],
2025-05-07T20:32:55.2859170Z         contiguous: bool,
2025-05-07T20:32:55.2859412Z         compiled: bool,
2025-05-07T20:32:55.2859653Z     ) -> None:
2025-05-07T20:32:55.2859879Z         torch.manual_seed(2025)
2025-05-07T20:32:55.2860120Z     
2025-05-07T20:32:55.2860401Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.2860753Z     
2025-05-07T20:32:55.2860949Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.2861251Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.2861572Z         x = x_sign * x_clamp
2025-05-07T20:32:55.2861812Z         x0 = x[:, :D]
2025-05-07T20:32:55.2862036Z         x1 = x[:, D:]
2025-05-07T20:32:55.2862247Z     
2025-05-07T20:32:55.2862440Z         if contiguous:
2025-05-07T20:32:55.2862678Z             x0 = x0.contiguous()
2025-05-07T20:32:55.2862944Z             x1 = x1.contiguous()
2025-05-07T20:32:55.2863189Z     
2025-05-07T20:32:55.2863380Z         if scale_ub is not None:
2025-05-07T20:32:55.2863663Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.2864011Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.2864320Z             )
2025-05-07T20:32:55.2864518Z         else:
2025-05-07T20:32:55.2864735Z             scale_ub_tensor = None
2025-05-07T20:32:55.2864987Z     
2025-05-07T20:32:55.2865232Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.2865556Z             op = silu_mul_quant
2025-05-07T20:32:55.2865808Z             if compiled:
2025-05-07T20:32:55.2866062Z                 op = torch.compile(op)
2025-05-07T20:32:55.2866369Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.2866644Z     
2025-05-07T20:32:55.2866846Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.2867212Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.2867511Z     
2025-05-07T20:32:55.2867755Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.2868099Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.2868402Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.2868722Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.2869094Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.2869412Z     
2025-05-07T20:32:55.2869615Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.2869826Z 
2025-05-07T20:32:55.2869930Z moe/activation_test.py:126: 
2025-05-07T20:32:55.2870231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.2870571Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.2870904Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.2871709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.2872482Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.2873040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.2873854Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.2874559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.2875357Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.2876123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.2876892Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.2877642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.2878301Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.2878960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.2879495Z     fn()
2025-05-07T20:32:55.2880019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.2880608Z     self.fn.run(
2025-05-07T20:32:55.2881094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.2881638Z     kernel = self.compile(
2025-05-07T20:32:55.2882195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.2882859Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.2883265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.2883494Z 
2025-05-07T20:32:55.2883715Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da39bbe0>
2025-05-07T20:32:55.2884883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.2886286Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daf79b40>}
2025-05-07T20:32:55.2887666Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.2888718Z context = <triton._C.libtriton.ir.context object at 0x7fd505c3c630>
2025-05-07T20:32:55.2889008Z 
2025-05-07T20:32:55.2889228Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.2889760Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.2890251Z                            module_map=module_map)
2025-05-07T20:32:55.2890628Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.2890994Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.2891263Z E       ^
2025-05-07T20:32:55.2891740Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.2892204Z 
2025-05-07T20:32:55.2892637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.2893160Z 
2025-05-07T20:32:55.2893267Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.2893692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.2894106Z     T=1,
2025-05-07T20:32:55.2894298Z     D=5120,
2025-05-07T20:32:55.2894491Z     scale_ub=None,
2025-05-07T20:32:55.2894713Z     contiguous=True,
2025-05-07T20:32:55.2894945Z     compiled=False,
2025-05-07T20:32:55.2895153Z )
2025-05-07T20:32:55.6102863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.6103623Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:55.6104038Z 
2025-05-07T20:32:55.6104151Z     @given(
2025-05-07T20:32:55.6104502Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.6104820Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.6105140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.6105486Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.6105818Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.6106111Z     )
2025-05-07T20:32:55.6106471Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.6106923Z     def test_silu_mul_quant(
2025-05-07T20:32:55.6107169Z         self,
2025-05-07T20:32:55.6107443Z         T: int,
2025-05-07T20:32:55.6107646Z         D: int,
2025-05-07T20:32:55.6107869Z         scale_ub: Optional[float],
2025-05-07T20:32:55.6108155Z         contiguous: bool,
2025-05-07T20:32:55.6108402Z         compiled: bool,
2025-05-07T20:32:55.6108631Z     ) -> None:
2025-05-07T20:32:55.6108856Z         torch.manual_seed(2025)
2025-05-07T20:32:55.6109108Z     
2025-05-07T20:32:55.6109386Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.6109738Z     
2025-05-07T20:32:55.6109940Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.6110233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.6110549Z         x = x_sign * x_clamp
2025-05-07T20:32:55.6110800Z         x0 = x[:, :D]
2025-05-07T20:32:55.6111018Z         x1 = x[:, D:]
2025-05-07T20:32:55.6111233Z     
2025-05-07T20:32:55.6111428Z         if contiguous:
2025-05-07T20:32:55.6111670Z             x0 = x0.contiguous()
2025-05-07T20:32:55.6111929Z             x1 = x1.contiguous()
2025-05-07T20:32:55.6112180Z     
2025-05-07T20:32:55.6112382Z         if scale_ub is not None:
2025-05-07T20:32:55.6112658Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.6113007Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.6113320Z             )
2025-05-07T20:32:55.6113515Z         else:
2025-05-07T20:32:55.6113744Z             scale_ub_tensor = None
2025-05-07T20:32:55.6114001Z     
2025-05-07T20:32:55.6114236Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.6114557Z             op = silu_mul_quant
2025-05-07T20:32:55.6114815Z             if compiled:
2025-05-07T20:32:55.6115066Z                 op = torch.compile(op)
2025-05-07T20:32:55.6115372Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.6115650Z     
2025-05-07T20:32:55.6115918Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.6116093Z 
2025-05-07T20:32:55.6116195Z moe/activation_test.py:117: 
2025-05-07T20:32:55.6116493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.6116830Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.6117119Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.6117827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.6118531Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.6119072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.6126001Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.6126730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.6127277Z     kernel = self.compile(
2025-05-07T20:32:55.6127832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.6128501Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.6128976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.6129204Z 
2025-05-07T20:32:55.6129415Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da39b5e0>
2025-05-07T20:32:55.6130550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.6131946Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daf79ea0>}
2025-05-07T20:32:55.6133360Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.6134439Z context = <triton._C.libtriton.ir.context object at 0x7fd505b51170>
2025-05-07T20:32:55.6134731Z 
2025-05-07T20:32:55.6134898Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.6135421Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.6135895Z                            module_map=module_map)
2025-05-07T20:32:55.6136261Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.6136623Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.6136878Z E       ^
2025-05-07T20:32:55.6137350Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.6137807Z 
2025-05-07T20:32:55.6138341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.6138864Z 
2025-05-07T20:32:55.6138970Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.6139388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.6139786Z     T=128,
2025-05-07T20:32:55.6139980Z     D=5120,
2025-05-07T20:32:55.6140175Z     scale_ub=None,
2025-05-07T20:32:55.6140397Z     contiguous=False,
2025-05-07T20:32:55.6140625Z     compiled=True,
2025-05-07T20:32:55.6140828Z )
2025-05-07T20:32:55.6141151Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.6141644Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.6141917Z 
2025-05-07T20:32:55.6141994Z     @given(
2025-05-07T20:32:55.6142225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.6142584Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.6142893Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.6143222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.6143551Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.6143837Z     )
2025-05-07T20:32:55.6144197Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.6144642Z     def test_silu_mul_quant(
2025-05-07T20:32:55.6144879Z         self,
2025-05-07T20:32:55.6145073Z         T: int,
2025-05-07T20:32:55.6145269Z         D: int,
2025-05-07T20:32:55.6145483Z         scale_ub: Optional[float],
2025-05-07T20:32:55.6145757Z         contiguous: bool,
2025-05-07T20:32:55.6145997Z         compiled: bool,
2025-05-07T20:32:55.6146219Z     ) -> None:
2025-05-07T20:32:55.6146437Z         torch.manual_seed(2025)
2025-05-07T20:32:55.6146678Z     
2025-05-07T20:32:55.6146950Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.6147294Z     
2025-05-07T20:32:55.6147490Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.6147777Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.6148089Z         x = x_sign * x_clamp
2025-05-07T20:32:55.6148326Z         x0 = x[:, :D]
2025-05-07T20:32:55.6148587Z         x1 = x[:, D:]
2025-05-07T20:32:55.6148796Z     
2025-05-07T20:32:55.6148986Z         if contiguous:
2025-05-07T20:32:55.6149217Z             x0 = x0.contiguous()
2025-05-07T20:32:55.6149468Z             x1 = x1.contiguous()
2025-05-07T20:32:55.6149752Z     
2025-05-07T20:32:55.6149937Z         if scale_ub is not None:
2025-05-07T20:32:55.6150211Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.6150548Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.6150849Z             )
2025-05-07T20:32:55.6151040Z         else:
2025-05-07T20:32:55.6151248Z             scale_ub_tensor = None
2025-05-07T20:32:55.6151491Z     
2025-05-07T20:32:55.6151725Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.6152039Z             op = silu_mul_quant
2025-05-07T20:32:55.6152282Z             if compiled:
2025-05-07T20:32:55.6152575Z                 op = torch.compile(op)
2025-05-07T20:32:55.6152873Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.6153144Z     
2025-05-07T20:32:55.6153334Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.6153497Z 
2025-05-07T20:32:55.6153598Z moe/activation_test.py:117: 
2025-05-07T20:32:55.6153891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.6154216Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.6154500Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.6155061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.6155862Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.6156530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.6157225Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.6157763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.6158451Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.6159117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.6159655Z     kernel = self.compile(
2025-05-07T20:32:55.6160196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.6160857Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.6161251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.6161475Z 
2025-05-07T20:32:55.6161770Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505c2f880>
2025-05-07T20:32:55.6162860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.6164254Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daf78dc0>}
2025-05-07T20:32:55.6165616Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.6166655Z context = <triton._C.libtriton.ir.context object at 0x7fd505b4a7f0>
2025-05-07T20:32:55.6166941Z 
2025-05-07T20:32:55.6167111Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.6167631Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.6168101Z                            module_map=module_map)
2025-05-07T20:32:55.6168467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.6168879Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.6169140Z E       ^
2025-05-07T20:32:55.6169603Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.6170056Z 
2025-05-07T20:32:55.6170547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.6171060Z 
2025-05-07T20:32:55.6171163Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.6171577Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.6171978Z     T=128,
2025-05-07T20:32:55.6172162Z     D=7168,
2025-05-07T20:32:55.6172353Z     scale_ub=1200.0,
2025-05-07T20:32:55.6172579Z     contiguous=False,
2025-05-07T20:32:55.6172797Z     compiled=False,
2025-05-07T20:32:55.6172996Z )
2025-05-07T20:32:55.7425119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.7426006Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.7426411Z 
2025-05-07T20:32:55.7426518Z     @given(
2025-05-07T20:32:55.7426844Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.7427284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.7427645Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.7427985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.7428321Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.7428608Z     )
2025-05-07T20:32:55.7428964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.7429414Z     def test_silu_mul_quant(
2025-05-07T20:32:55.7429662Z         self,
2025-05-07T20:32:55.7429857Z         T: int,
2025-05-07T20:32:55.7430057Z         D: int,
2025-05-07T20:32:55.7430282Z         scale_ub: Optional[float],
2025-05-07T20:32:55.7430591Z         contiguous: bool,
2025-05-07T20:32:55.7430836Z         compiled: bool,
2025-05-07T20:32:55.7431064Z     ) -> None:
2025-05-07T20:32:55.7431290Z         torch.manual_seed(2025)
2025-05-07T20:32:55.7431538Z     
2025-05-07T20:32:55.7431812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.7432163Z     
2025-05-07T20:32:55.7432365Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.7432658Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.7432974Z         x = x_sign * x_clamp
2025-05-07T20:32:55.7433222Z         x0 = x[:, :D]
2025-05-07T20:32:55.7433462Z         x1 = x[:, D:]
2025-05-07T20:32:55.7433707Z     
2025-05-07T20:32:55.7433900Z         if contiguous:
2025-05-07T20:32:55.7434137Z             x0 = x0.contiguous()
2025-05-07T20:32:55.7434481Z             x1 = x1.contiguous()
2025-05-07T20:32:55.7434732Z     
2025-05-07T20:32:55.7434926Z         if scale_ub is not None:
2025-05-07T20:32:55.7435207Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.7435551Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.7435863Z             )
2025-05-07T20:32:55.7436061Z         else:
2025-05-07T20:32:55.7436275Z             scale_ub_tensor = None
2025-05-07T20:32:55.7436530Z     
2025-05-07T20:32:55.7436763Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.7437090Z             op = silu_mul_quant
2025-05-07T20:32:55.7437350Z             if compiled:
2025-05-07T20:32:55.7437597Z                 op = torch.compile(op)
2025-05-07T20:32:55.7437900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.7438177Z     
2025-05-07T20:32:55.7438371Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.7438547Z 
2025-05-07T20:32:55.7438649Z moe/activation_test.py:117: 
2025-05-07T20:32:55.7438953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.7439280Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.7439567Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.7440340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.7441041Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.7441585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.7442336Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.7443010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.7443550Z     kernel = self.compile(
2025-05-07T20:32:55.7444151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.7444821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.7445221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.7445521Z 
2025-05-07T20:32:55.7445732Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505f6bd90>
2025-05-07T20:32:55.7446834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.7448234Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daba3f40>}
2025-05-07T20:32:55.7449601Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.7450647Z context = <triton._C.libtriton.ir.context object at 0x7fd505bb64b0>
2025-05-07T20:32:55.7450940Z 
2025-05-07T20:32:55.7451107Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.7451642Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.7452117Z                            module_map=module_map)
2025-05-07T20:32:55.7452488Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.7452844Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.7453108Z E       ^
2025-05-07T20:32:55.7453581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.7454037Z 
2025-05-07T20:32:55.7454459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.7455030Z 
2025-05-07T20:32:55.7455138Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.7455738Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.7456155Z     T=128,
2025-05-07T20:32:55.7456346Z     D=5120,
2025-05-07T20:32:55.7456550Z     scale_ub=None,
2025-05-07T20:32:55.7456771Z     contiguous=False,
2025-05-07T20:32:55.7457002Z     compiled=False,
2025-05-07T20:32:55.7457218Z )
2025-05-07T20:32:55.7457542Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.7458133Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.7458406Z 
2025-05-07T20:32:55.7458484Z     @given(
2025-05-07T20:32:55.7458716Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.7459038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.7459346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.7459684Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.7460021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.7460306Z     )
2025-05-07T20:32:55.7460663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.7461183Z     def test_silu_mul_quant(
2025-05-07T20:32:55.7461436Z         self,
2025-05-07T20:32:55.7461635Z         T: int,
2025-05-07T20:32:55.7461835Z         D: int,
2025-05-07T20:32:55.7462057Z         scale_ub: Optional[float],
2025-05-07T20:32:55.7462392Z         contiguous: bool,
2025-05-07T20:32:55.7462637Z         compiled: bool,
2025-05-07T20:32:55.7462862Z     ) -> None:
2025-05-07T20:32:55.7463077Z         torch.manual_seed(2025)
2025-05-07T20:32:55.7463319Z     
2025-05-07T20:32:55.7463599Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.7463939Z     
2025-05-07T20:32:55.7464138Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.7464435Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.7464744Z         x = x_sign * x_clamp
2025-05-07T20:32:55.7464986Z         x0 = x[:, :D]
2025-05-07T20:32:55.7465205Z         x1 = x[:, D:]
2025-05-07T20:32:55.7465482Z     
2025-05-07T20:32:55.7465676Z         if contiguous:
2025-05-07T20:32:55.7465916Z             x0 = x0.contiguous()
2025-05-07T20:32:55.7466177Z             x1 = x1.contiguous()
2025-05-07T20:32:55.7466417Z     
2025-05-07T20:32:55.7466611Z         if scale_ub is not None:
2025-05-07T20:32:55.7466888Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.7467229Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.7467541Z             )
2025-05-07T20:32:55.7467739Z         else:
2025-05-07T20:32:55.7467948Z             scale_ub_tensor = None
2025-05-07T20:32:55.7468200Z     
2025-05-07T20:32:55.7468439Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.7468750Z             op = silu_mul_quant
2025-05-07T20:32:55.7469005Z             if compiled:
2025-05-07T20:32:55.7469255Z                 op = torch.compile(op)
2025-05-07T20:32:55.7469554Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.7469832Z     
2025-05-07T20:32:55.7470032Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.7470200Z 
2025-05-07T20:32:55.7470302Z moe/activation_test.py:117: 
2025-05-07T20:32:55.7470602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.7470937Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.7471231Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.7471927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.7472626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.7473175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.7473927Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.7474602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.7475143Z     kernel = self.compile(
2025-05-07T20:32:55.7475692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.7476351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.7476747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.7476977Z 
2025-05-07T20:32:55.7477188Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505f69fc0>
2025-05-07T20:32:55.7478274Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.7479662Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daba12d0>}
2025-05-07T20:32:55.7481072Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.7482112Z context = <triton._C.libtriton.ir.context object at 0x7fd505fb1fb0>
2025-05-07T20:32:55.7482440Z 
2025-05-07T20:32:55.7482613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.7483136Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.7483609Z                            module_map=module_map)
2025-05-07T20:32:55.7483977Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.7484332Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.7484592Z E       ^
2025-05-07T20:32:55.7485057Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.7485557Z 
2025-05-07T20:32:55.7485986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.7486502Z 
2025-05-07T20:32:55.7486610Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.7487023Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.7487428Z     T=128,
2025-05-07T20:32:55.7487619Z     D=5120,
2025-05-07T20:32:55.7487809Z     scale_ub=1200.0,
2025-05-07T20:32:55.7488035Z     contiguous=True,
2025-05-07T20:32:55.7488260Z     compiled=False,
2025-05-07T20:32:55.7488464Z )
2025-05-07T20:32:55.9417323Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.9418237Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.9418626Z 
2025-05-07T20:32:55.9418736Z     @given(
2025-05-07T20:32:55.9419057Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.9419474Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.9419791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.9420135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.9420476Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.9420763Z     )
2025-05-07T20:32:55.9421128Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.9421587Z     def test_silu_mul_quant(
2025-05-07T20:32:55.9421834Z         self,
2025-05-07T20:32:55.9422032Z         T: int,
2025-05-07T20:32:55.9422237Z         D: int,
2025-05-07T20:32:55.9422464Z         scale_ub: Optional[float],
2025-05-07T20:32:55.9422739Z         contiguous: bool,
2025-05-07T20:32:55.9422985Z         compiled: bool,
2025-05-07T20:32:55.9423216Z     ) -> None:
2025-05-07T20:32:55.9423552Z         torch.manual_seed(2025)
2025-05-07T20:32:55.9423817Z     
2025-05-07T20:32:55.9424138Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.9424489Z     
2025-05-07T20:32:55.9424690Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.9424996Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.9425312Z         x = x_sign * x_clamp
2025-05-07T20:32:55.9425556Z         x0 = x[:, :D]
2025-05-07T20:32:55.9425780Z         x1 = x[:, D:]
2025-05-07T20:32:55.9425992Z     
2025-05-07T20:32:55.9426186Z         if contiguous:
2025-05-07T20:32:55.9426427Z             x0 = x0.contiguous()
2025-05-07T20:32:55.9426685Z             x1 = x1.contiguous()
2025-05-07T20:32:55.9426932Z     
2025-05-07T20:32:55.9427137Z         if scale_ub is not None:
2025-05-07T20:32:55.9427417Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.9427755Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.9428076Z             )
2025-05-07T20:32:55.9428273Z         else:
2025-05-07T20:32:55.9428486Z             scale_ub_tensor = None
2025-05-07T20:32:55.9428743Z     
2025-05-07T20:32:55.9428986Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.9429299Z             op = silu_mul_quant
2025-05-07T20:32:55.9429624Z             if compiled:
2025-05-07T20:32:55.9429877Z                 op = torch.compile(op)
2025-05-07T20:32:55.9430179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9430516Z     
2025-05-07T20:32:55.9430717Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.9430886Z 
2025-05-07T20:32:55.9430989Z moe/activation_test.py:117: 
2025-05-07T20:32:55.9431288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9431624Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.9431908Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9432610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.9433314Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.9433860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.9434626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.9435299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.9435845Z     kernel = self.compile(
2025-05-07T20:32:55.9436402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.9437066Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.9437471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9437700Z 
2025-05-07T20:32:55.9437919Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505f685b0>
2025-05-07T20:32:55.9439021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.9440422Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daba3d90>}
2025-05-07T20:32:55.9441796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.9442841Z context = <triton._C.libtriton.ir.context object at 0x7fd505f213f0>
2025-05-07T20:32:55.9443132Z 
2025-05-07T20:32:55.9443307Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.9443936Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.9444420Z                            module_map=module_map)
2025-05-07T20:32:55.9444796Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.9445163Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.9445422Z E       ^
2025-05-07T20:32:55.9445896Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.9446352Z 
2025-05-07T20:32:55.9446778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.9447296Z 
2025-05-07T20:32:55.9447406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.9447821Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.9448226Z     T=1,
2025-05-07T20:32:55.9448413Z     D=7168,
2025-05-07T20:32:55.9448640Z     scale_ub=1200.0,
2025-05-07T20:32:55.9448866Z     contiguous=True,
2025-05-07T20:32:55.9449096Z     compiled=True,
2025-05-07T20:32:55.9449306Z )
2025-05-07T20:32:55.9449629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.9450165Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.9450429Z 
2025-05-07T20:32:55.9450511Z     @given(
2025-05-07T20:32:55.9450745Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.9451065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.9451446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.9451779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.9452111Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.9452397Z     )
2025-05-07T20:32:55.9452751Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.9453191Z     def test_silu_mul_quant(
2025-05-07T20:32:55.9453442Z         self,
2025-05-07T20:32:55.9453645Z         T: int,
2025-05-07T20:32:55.9453864Z         D: int,
2025-05-07T20:32:55.9454111Z         scale_ub: Optional[float],
2025-05-07T20:32:55.9454437Z         contiguous: bool,
2025-05-07T20:32:55.9460822Z         compiled: bool,
2025-05-07T20:32:55.9461063Z     ) -> None:
2025-05-07T20:32:55.9461286Z         torch.manual_seed(2025)
2025-05-07T20:32:55.9461528Z     
2025-05-07T20:32:55.9461800Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.9462140Z     
2025-05-07T20:32:55.9462331Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.9462627Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.9462931Z         x = x_sign * x_clamp
2025-05-07T20:32:55.9463169Z         x0 = x[:, :D]
2025-05-07T20:32:55.9463388Z         x1 = x[:, D:]
2025-05-07T20:32:55.9463584Z     
2025-05-07T20:32:55.9463769Z         if contiguous:
2025-05-07T20:32:55.9464009Z             x0 = x0.contiguous()
2025-05-07T20:32:55.9464266Z             x1 = x1.contiguous()
2025-05-07T20:32:55.9464505Z     
2025-05-07T20:32:55.9464695Z         if scale_ub is not None:
2025-05-07T20:32:55.9464972Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.9465315Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.9465624Z             )
2025-05-07T20:32:55.9465813Z         else:
2025-05-07T20:32:55.9466030Z             scale_ub_tensor = None
2025-05-07T20:32:55.9466283Z     
2025-05-07T20:32:55.9466514Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.9466834Z             op = silu_mul_quant
2025-05-07T20:32:55.9467086Z             if compiled:
2025-05-07T20:32:55.9467333Z                 op = torch.compile(op)
2025-05-07T20:32:55.9467636Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9467914Z     
2025-05-07T20:32:55.9468113Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.9468280Z 
2025-05-07T20:32:55.9468381Z moe/activation_test.py:117: 
2025-05-07T20:32:55.9468788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9469122Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.9469408Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9469980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.9470545Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.9471212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.9471907Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.9472449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.9473135Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.9473855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.9474388Z     kernel = self.compile(
2025-05-07T20:32:55.9474937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.9475670Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.9476066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9476296Z 
2025-05-07T20:32:55.9476506Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505f6a4d0>
2025-05-07T20:32:55.9477667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.9479063Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daba31c0>}
2025-05-07T20:32:55.9480421Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.9481525Z context = <triton._C.libtriton.ir.context object at 0x7fd505dda1b0>
2025-05-07T20:32:55.9481817Z 
2025-05-07T20:32:55.9481986Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.9482515Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.9482986Z                            module_map=module_map)
2025-05-07T20:32:55.9483351Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.9483708Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.9483995Z E       ^
2025-05-07T20:32:55.9484488Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.9484941Z 
2025-05-07T20:32:55.9485365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.9485885Z 
2025-05-07T20:32:55.9485990Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.9486406Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.9486808Z     T=1,
2025-05-07T20:32:55.9486991Z     D=7168,
2025-05-07T20:32:55.9487181Z     scale_ub=1200.0,
2025-05-07T20:32:55.9487410Z     contiguous=False,
2025-05-07T20:32:55.9487634Z     compiled=True,
2025-05-07T20:32:55.9487831Z )
2025-05-07T20:32:56.0864990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.0865746Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:56.0866122Z 
2025-05-07T20:32:56.0866237Z     @given(
2025-05-07T20:32:56.0866663Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.0866986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.0867295Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.0867631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.0867969Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.0868260Z     )
2025-05-07T20:32:56.0868615Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.0869057Z     def test_silu_mul_quant(
2025-05-07T20:32:56.0869308Z         self,
2025-05-07T20:32:56.0869506Z         T: int,
2025-05-07T20:32:56.0869699Z         D: int,
2025-05-07T20:32:56.0869922Z         scale_ub: Optional[float],
2025-05-07T20:32:56.0870194Z         contiguous: bool,
2025-05-07T20:32:56.0870431Z         compiled: bool,
2025-05-07T20:32:56.0870656Z     ) -> None:
2025-05-07T20:32:56.0870874Z         torch.manual_seed(2025)
2025-05-07T20:32:56.0871113Z     
2025-05-07T20:32:56.0871395Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.0871739Z     
2025-05-07T20:32:56.0871929Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.0872226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.0872536Z         x = x_sign * x_clamp
2025-05-07T20:32:56.0872856Z         x0 = x[:, :D]
2025-05-07T20:32:56.0873071Z         x1 = x[:, D:]
2025-05-07T20:32:56.0873279Z     
2025-05-07T20:32:56.0873468Z         if contiguous:
2025-05-07T20:32:56.0873700Z             x0 = x0.contiguous()
2025-05-07T20:32:56.0874021Z             x1 = x1.contiguous()
2025-05-07T20:32:56.0874263Z     
2025-05-07T20:32:56.0874453Z         if scale_ub is not None:
2025-05-07T20:32:56.0874728Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.0875069Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.0875377Z             )
2025-05-07T20:32:56.0875574Z         else:
2025-05-07T20:32:56.0875788Z             scale_ub_tensor = None
2025-05-07T20:32:56.0876047Z     
2025-05-07T20:32:56.0876282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.0876595Z             op = silu_mul_quant
2025-05-07T20:32:56.0876912Z             if compiled:
2025-05-07T20:32:56.0877164Z                 op = torch.compile(op)
2025-05-07T20:32:56.0877474Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.0877749Z     
2025-05-07T20:32:56.0877952Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.0878120Z 
2025-05-07T20:32:56.0878220Z moe/activation_test.py:117: 
2025-05-07T20:32:56.0878518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.0878843Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.0879126Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.0879695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:56.0880255Z     return fn(*args, **kwargs)
2025-05-07T20:32:56.0880922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.0881618Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.0882162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.0882846Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.0883517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.0884057Z     kernel = self.compile(
2025-05-07T20:32:56.0884602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.0885263Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.0885658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.0885883Z 
2025-05-07T20:32:56.0886146Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505f057b0>
2025-05-07T20:32:56.0887241Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.0888640Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daba05e0>}
2025-05-07T20:32:56.0890004Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.0891045Z context = <triton._C.libtriton.ir.context object at 0x7fd505dd1570>
2025-05-07T20:32:56.0891334Z 
2025-05-07T20:32:56.0891512Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.0892042Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.0892519Z                            module_map=module_map)
2025-05-07T20:32:56.0892935Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.0893287Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.0893550Z E       ^
2025-05-07T20:32:56.0894019Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.0894515Z 
2025-05-07T20:32:56.0894939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.0895459Z 
2025-05-07T20:32:56.0895565Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.0895980Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.0896380Z     T=1,
2025-05-07T20:32:56.0896563Z     D=7168,
2025-05-07T20:32:56.0896757Z     scale_ub=None,
2025-05-07T20:32:56.0896978Z     contiguous=False,
2025-05-07T20:32:56.0897203Z     compiled=True,
2025-05-07T20:32:56.0897447Z )
2025-05-07T20:32:56.3457296Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.3457940Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:56.3458437Z 
2025-05-07T20:32:56.3458589Z     @given(
2025-05-07T20:32:56.3458921Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.3459372Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.3459801Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.3460200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.3460538Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.3460823Z     )
2025-05-07T20:32:56.3461186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.3461636Z     def test_silu_mul_quant(
2025-05-07T20:32:56.3461885Z         self,
2025-05-07T20:32:56.3462078Z         T: int,
2025-05-07T20:32:56.3462276Z         D: int,
2025-05-07T20:32:56.3462506Z         scale_ub: Optional[float],
2025-05-07T20:32:56.3462782Z         contiguous: bool,
2025-05-07T20:32:56.3463027Z         compiled: bool,
2025-05-07T20:32:56.3463256Z     ) -> None:
2025-05-07T20:32:56.3463473Z         torch.manual_seed(2025)
2025-05-07T20:32:56.3463721Z     
2025-05-07T20:32:56.3464003Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.3464350Z     
2025-05-07T20:32:56.3464545Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.3464847Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.3465153Z         x = x_sign * x_clamp
2025-05-07T20:32:56.3465394Z         x0 = x[:, :D]
2025-05-07T20:32:56.3465614Z         x1 = x[:, D:]
2025-05-07T20:32:56.3465820Z     
2025-05-07T20:32:56.3466009Z         if contiguous:
2025-05-07T20:32:56.3466371Z             x0 = x0.contiguous()
2025-05-07T20:32:56.3466635Z             x1 = x1.contiguous()
2025-05-07T20:32:56.3466884Z     
2025-05-07T20:32:56.3467093Z         if scale_ub is not None:
2025-05-07T20:32:56.3467375Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.3467716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.3468029Z             )
2025-05-07T20:32:56.3468226Z         else:
2025-05-07T20:32:56.3468435Z             scale_ub_tensor = None
2025-05-07T20:32:56.3468697Z     
2025-05-07T20:32:56.3468936Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.3469250Z             op = silu_mul_quant
2025-05-07T20:32:56.3469503Z             if compiled:
2025-05-07T20:32:56.3469757Z                 op = torch.compile(op)
2025-05-07T20:32:56.3470061Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.3470340Z     
2025-05-07T20:32:56.3470536Z         y_fp8, y_scale = fn()
2025-05-07T20:32:56.3470836Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:56.3471129Z     
2025-05-07T20:32:56.3471372Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.3471716Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:56.3472081Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:56.3472408Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:56.3472768Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:56.3473076Z     
2025-05-07T20:32:56.3473374Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:56.3473580Z 
2025-05-07T20:32:56.3473683Z moe/activation_test.py:126: 
2025-05-07T20:32:56.3473982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.3474314Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:56.3474650Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:56.3475455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:56.3476217Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:56.3476844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.3477537Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.3478237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:56.3478969Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:56.3479728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:56.3480487Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:56.3481232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:56.3481879Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:56.3482497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:56.3483032Z     fn()
2025-05-07T20:32:56.3483593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:56.3484183Z     self.fn.run(
2025-05-07T20:32:56.3484660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.3485210Z     kernel = self.compile(
2025-05-07T20:32:56.3485756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.3486420Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.3486868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.3487095Z 
2025-05-07T20:32:56.3487311Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505d683d0>
2025-05-07T20:32:56.3488415Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.3489818Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da3aaef0>}
2025-05-07T20:32:56.3491189Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.3492235Z context = <triton._C.libtriton.ir.context object at 0x7fd505e79570>
2025-05-07T20:32:56.3492526Z 
2025-05-07T20:32:56.3492701Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.3493234Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.3493814Z                            module_map=module_map)
2025-05-07T20:32:56.3494190Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.3494550Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:56.3494828Z E       ^
2025-05-07T20:32:56.3495342Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.3495799Z 
2025-05-07T20:32:56.3496223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.3496741Z 
2025-05-07T20:32:56.3496847Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.3497273Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.3497678Z     T=1,
2025-05-07T20:32:56.3497859Z     D=5120,
2025-05-07T20:32:56.3498144Z     scale_ub=1200.0,
2025-05-07T20:32:56.3498426Z     contiguous=False,
2025-05-07T20:32:56.3498653Z     compiled=True,
2025-05-07T20:32:56.3498859Z )
2025-05-07T20:32:56.5181253Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.5182636Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:56.5183323Z 
2025-05-07T20:32:56.5183546Z     @given(
2025-05-07T20:32:56.5184014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.5184440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.5184753Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.5185090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.5185419Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.5185710Z     )
2025-05-07T20:32:56.5186068Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.5186508Z     def test_silu_mul_quant(
2025-05-07T20:32:56.5186758Z         self,
2025-05-07T20:32:56.5186959Z         T: int,
2025-05-07T20:32:56.5187154Z         D: int,
2025-05-07T20:32:56.5187380Z         scale_ub: Optional[float],
2025-05-07T20:32:56.5187656Z         contiguous: bool,
2025-05-07T20:32:56.5187898Z         compiled: bool,
2025-05-07T20:32:56.5188125Z     ) -> None:
2025-05-07T20:32:56.5188347Z         torch.manual_seed(2025)
2025-05-07T20:32:56.5188589Z     
2025-05-07T20:32:56.5188870Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.5189218Z     
2025-05-07T20:32:56.5189415Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.5189707Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.5190022Z         x = x_sign * x_clamp
2025-05-07T20:32:56.5190270Z         x0 = x[:, :D]
2025-05-07T20:32:56.5190603Z         x1 = x[:, D:]
2025-05-07T20:32:56.5190817Z     
2025-05-07T20:32:56.5191005Z         if contiguous:
2025-05-07T20:32:56.5191238Z             x0 = x0.contiguous()
2025-05-07T20:32:56.5191509Z             x1 = x1.contiguous()
2025-05-07T20:32:56.5191753Z     
2025-05-07T20:32:56.5191950Z         if scale_ub is not None:
2025-05-07T20:32:56.5192232Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.5192577Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.5192883Z             )
2025-05-07T20:32:56.5193080Z         else:
2025-05-07T20:32:56.5193297Z             scale_ub_tensor = None
2025-05-07T20:32:56.5193549Z     
2025-05-07T20:32:56.5193788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.5194107Z             op = silu_mul_quant
2025-05-07T20:32:56.5194359Z             if compiled:
2025-05-07T20:32:56.5194610Z                 op = torch.compile(op)
2025-05-07T20:32:56.5194913Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.5195193Z     
2025-05-07T20:32:56.5195386Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.5195557Z 
2025-05-07T20:32:56.5195657Z moe/activation_test.py:117: 
2025-05-07T20:32:56.5195957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.5196356Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.5196645Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.5197210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:56.5197836Z     return fn(*args, **kwargs)
2025-05-07T20:32:56.5198499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.5199199Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.5199739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.5200424Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.5201096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.5201697Z     kernel = self.compile(
2025-05-07T20:32:56.5202254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.5202914Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.5203322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.5203546Z 
2025-05-07T20:32:56.5203787Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505dc0670>
2025-05-07T20:32:56.5204906Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.5206298Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da3abeb0>}
2025-05-07T20:32:56.5207662Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.5208701Z context = <triton._C.libtriton.ir.context object at 0x7fd505edf270>
2025-05-07T20:32:56.5208989Z 
2025-05-07T20:32:56.5209164Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.5209690Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.5210166Z                            module_map=module_map)
2025-05-07T20:32:56.5210548Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.5210958Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.5211219Z E       ^
2025-05-07T20:32:56.5211688Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.5212146Z 
2025-05-07T20:32:56.5212575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.5213091Z 
2025-05-07T20:32:56.5213202Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.5213618Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.5214039Z     T=1,
2025-05-07T20:32:56.5214263Z     D=5120,
2025-05-07T20:32:56.5214465Z     scale_ub=1200.0,
2025-05-07T20:32:56.5214696Z     contiguous=False,
2025-05-07T20:32:56.5214928Z     compiled=False,
2025-05-07T20:32:56.5215131Z )
2025-05-07T20:32:56.5215454Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.5215955Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:56.5216227Z 
2025-05-07T20:32:56.5216305Z     @given(
2025-05-07T20:32:56.5216543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.5216869Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.5217233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.5217565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.5217901Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.5218274Z     )
2025-05-07T20:32:56.5218673Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.5219119Z     def test_silu_mul_quant(
2025-05-07T20:32:56.5219361Z         self,
2025-05-07T20:32:56.5219552Z         T: int,
2025-05-07T20:32:56.5219749Z         D: int,
2025-05-07T20:32:56.5219971Z         scale_ub: Optional[float],
2025-05-07T20:32:56.5220241Z         contiguous: bool,
2025-05-07T20:32:56.5220484Z         compiled: bool,
2025-05-07T20:32:56.5220712Z     ) -> None:
2025-05-07T20:32:56.5220925Z         torch.manual_seed(2025)
2025-05-07T20:32:56.5221170Z     
2025-05-07T20:32:56.5221447Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.5221839Z     
2025-05-07T20:32:56.5222040Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.5222340Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.5222650Z         x = x_sign * x_clamp
2025-05-07T20:32:56.5222889Z         x0 = x[:, :D]
2025-05-07T20:32:56.5223115Z         x1 = x[:, D:]
2025-05-07T20:32:56.5223325Z     
2025-05-07T20:32:56.5223508Z         if contiguous:
2025-05-07T20:32:56.5223745Z             x0 = x0.contiguous()
2025-05-07T20:32:56.5224005Z             x1 = x1.contiguous()
2025-05-07T20:32:56.5224242Z     
2025-05-07T20:32:56.5224433Z         if scale_ub is not None:
2025-05-07T20:32:56.5224710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.5225055Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.5225372Z             )
2025-05-07T20:32:56.5225567Z         else:
2025-05-07T20:32:56.5225780Z             scale_ub_tensor = None
2025-05-07T20:32:56.5226040Z     
2025-05-07T20:32:56.5232355Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.5232697Z             op = silu_mul_quant
2025-05-07T20:32:56.5232945Z             if compiled:
2025-05-07T20:32:56.5233197Z                 op = torch.compile(op)
2025-05-07T20:32:56.5233497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.5233794Z     
2025-05-07T20:32:56.5234005Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.5234171Z 
2025-05-07T20:32:56.5234275Z moe/activation_test.py:117: 
2025-05-07T20:32:56.5234568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.5234905Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.5235190Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.5235958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.5236654Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.5237203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.5237893Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.5238553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.5239087Z     kernel = self.compile(
2025-05-07T20:32:56.5239639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.5240297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.5240686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.5240916Z 
2025-05-07T20:32:56.5241131Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505e12d10>
2025-05-07T20:32:56.5242269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.5243671Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daba1c60>}
2025-05-07T20:32:56.5245121Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.5246157Z context = <triton._C.libtriton.ir.context object at 0x7fd505e949f0>
2025-05-07T20:32:56.5246455Z 
2025-05-07T20:32:56.5246624Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.5247157Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.5247628Z                            module_map=module_map)
2025-05-07T20:32:56.5248042Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.5248406Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.5248665Z E       ^
2025-05-07T20:32:56.5249130Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.5249594Z 
2025-05-07T20:32:56.5250017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.5250537Z 
2025-05-07T20:32:56.5250651Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.5251072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.5251473Z     T=16384,
2025-05-07T20:32:56.5251669Z     D=5120,
2025-05-07T20:32:56.5251869Z     scale_ub=1200.0,
2025-05-07T20:32:56.5252092Z     contiguous=False,
2025-05-07T20:32:56.5252319Z     compiled=True,
2025-05-07T20:32:56.5252523Z )
2025-05-07T20:32:56.6246934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.6248485Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:56.6249261Z 
2025-05-07T20:32:56.6249475Z     @given(
2025-05-07T20:32:56.6250121Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.6250833Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.6251445Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.6252108Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.6252769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.6253337Z     )
2025-05-07T20:32:56.6253874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.6254465Z     def test_silu_mul_quant(
2025-05-07T20:32:56.6254714Z         self,
2025-05-07T20:32:56.6254905Z         T: int,
2025-05-07T20:32:56.6255103Z         D: int,
2025-05-07T20:32:56.6255325Z         scale_ub: Optional[float],
2025-05-07T20:32:56.6255778Z         contiguous: bool,
2025-05-07T20:32:56.6256029Z         compiled: bool,
2025-05-07T20:32:56.6256251Z     ) -> None:
2025-05-07T20:32:56.6256469Z         torch.manual_seed(2025)
2025-05-07T20:32:56.6256712Z     
2025-05-07T20:32:56.6256983Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.6257326Z     
2025-05-07T20:32:56.6257523Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.6257814Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.6258217Z         x = x_sign * x_clamp
2025-05-07T20:32:56.6258463Z         x0 = x[:, :D]
2025-05-07T20:32:56.6258676Z         x1 = x[:, D:]
2025-05-07T20:32:56.6258888Z     
2025-05-07T20:32:56.6259075Z         if contiguous:
2025-05-07T20:32:56.6259311Z             x0 = x0.contiguous()
2025-05-07T20:32:56.6259569Z             x1 = x1.contiguous()
2025-05-07T20:32:56.6259808Z     
2025-05-07T20:32:56.6260007Z         if scale_ub is not None:
2025-05-07T20:32:56.6260282Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.6260700Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.6261012Z             )
2025-05-07T20:32:56.6261201Z         else:
2025-05-07T20:32:56.6261412Z             scale_ub_tensor = None
2025-05-07T20:32:56.6261722Z     
2025-05-07T20:32:56.6261957Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.6262271Z             op = silu_mul_quant
2025-05-07T20:32:56.6262525Z             if compiled:
2025-05-07T20:32:56.6262769Z                 op = torch.compile(op)
2025-05-07T20:32:56.6263067Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.6263343Z     
2025-05-07T20:32:56.6263535Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.6263710Z 
2025-05-07T20:32:56.6263814Z moe/activation_test.py:117: 
2025-05-07T20:32:56.6264115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.6264515Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.6264799Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.6265370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:56.6265933Z     return fn(*args, **kwargs)
2025-05-07T20:32:56.6266590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.6267288Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.6267833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.6268516Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.6269181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.6269721Z     kernel = self.compile(
2025-05-07T20:32:56.6270269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.6270934Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.6271328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.6271564Z 
2025-05-07T20:32:56.6271773Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5058fae00>
2025-05-07T20:32:56.6272873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.6274327Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5058148b0>}
2025-05-07T20:32:56.6275684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.6276733Z context = <triton._C.libtriton.ir.context object at 0x7fd5058850b0>
2025-05-07T20:32:56.6277028Z 
2025-05-07T20:32:56.6277196Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.6277726Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.6278197Z                            module_map=module_map)
2025-05-07T20:32:56.6278561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.6278915Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.6279170Z E       ^
2025-05-07T20:32:56.6279641Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.6280098Z 
2025-05-07T20:32:56.6280517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.6281035Z 
2025-05-07T20:32:56.6281196Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.6281610Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.6282011Z     T=2048,
2025-05-07T20:32:56.6282245Z     D=7168,
2025-05-07T20:32:56.6282432Z     scale_ub=1200.0,
2025-05-07T20:32:56.6282661Z     contiguous=False,
2025-05-07T20:32:56.6282889Z     compiled=True,
2025-05-07T20:32:56.6283089Z )
2025-05-07T20:32:56.6283413Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.6283937Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:56.6284236Z 
2025-05-07T20:32:56.6284321Z     @given(
2025-05-07T20:32:56.6284554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.6284872Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.6285178Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.6285557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.6285893Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.6286184Z     )
2025-05-07T20:32:56.6286534Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.6286984Z     def test_silu_mul_quant(
2025-05-07T20:32:56.6287231Z         self,
2025-05-07T20:32:56.6287427Z         T: int,
2025-05-07T20:32:56.6287618Z         D: int,
2025-05-07T20:32:56.6287838Z         scale_ub: Optional[float],
2025-05-07T20:32:56.6288108Z         contiguous: bool,
2025-05-07T20:32:56.6288345Z         compiled: bool,
2025-05-07T20:32:56.6288566Z     ) -> None:
2025-05-07T20:32:56.6288785Z         torch.manual_seed(2025)
2025-05-07T20:32:56.6289018Z     
2025-05-07T20:32:56.6289300Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.6289642Z     
2025-05-07T20:32:56.6289834Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.6290127Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.6290436Z         x = x_sign * x_clamp
2025-05-07T20:32:56.6290670Z         x0 = x[:, :D]
2025-05-07T20:32:56.6290884Z         x1 = x[:, D:]
2025-05-07T20:32:56.6291093Z     
2025-05-07T20:32:56.6291277Z         if contiguous:
2025-05-07T20:32:56.6291510Z             x0 = x0.contiguous()
2025-05-07T20:32:56.6291765Z             x1 = x1.contiguous()
2025-05-07T20:32:56.6292001Z     
2025-05-07T20:32:56.6292198Z         if scale_ub is not None:
2025-05-07T20:32:56.6292473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.6292812Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.6293118Z             )
2025-05-07T20:32:56.6293309Z         else:
2025-05-07T20:32:56.6293571Z             scale_ub_tensor = None
2025-05-07T20:32:56.6293825Z     
2025-05-07T20:32:56.6294061Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.6294425Z             op = silu_mul_quant
2025-05-07T20:32:56.6294669Z             if compiled:
2025-05-07T20:32:56.6294920Z                 op = torch.compile(op)
2025-05-07T20:32:56.6295220Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.6295490Z     
2025-05-07T20:32:56.6295681Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.6295849Z 
2025-05-07T20:32:56.6295954Z moe/activation_test.py:117: 
2025-05-07T20:32:56.6296241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.6296569Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.6296850Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.6297411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:56.6297974Z     return fn(*args, **kwargs)
2025-05-07T20:32:56.6298703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.6299396Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.6299978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.6300661Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.6301391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.6301926Z     kernel = self.compile(
2025-05-07T20:32:56.6302464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.6303122Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.6303519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.6303744Z 
2025-05-07T20:32:56.6303955Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5058c28c0>
2025-05-07T20:32:56.6305089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.6306478Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505815090>}
2025-05-07T20:32:56.6307835Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.6308873Z context = <triton._C.libtriton.ir.context object at 0x7fd5058b3b30>
2025-05-07T20:32:56.6309162Z 
2025-05-07T20:32:56.6309349Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.6309874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.6310345Z                            module_map=module_map)
2025-05-07T20:32:56.6310714Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.6311069Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.6311320Z E       ^
2025-05-07T20:32:56.6311790Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.6312246Z 
2025-05-07T20:32:56.6312670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.6313188Z 
2025-05-07T20:32:56.7603014Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.7604111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.7604868Z     T=1,
2025-05-07T20:32:56.7605133Z     D=5120,
2025-05-07T20:32:56.7605409Z     scale_ub=None,
2025-05-07T20:32:56.7605675Z     contiguous=False,
2025-05-07T20:32:56.7605908Z     compiled=False,
2025-05-07T20:32:56.7606112Z )
2025-05-07T20:32:56.7606431Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.7606928Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:56.7607194Z 
2025-05-07T20:32:56.7607278Z     @given(
2025-05-07T20:32:56.7607515Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.7607829Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.7608140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.7608473Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.7608803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.7609092Z     )
2025-05-07T20:32:56.7609448Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.7609890Z     def test_silu_mul_quant(
2025-05-07T20:32:56.7610132Z         self,
2025-05-07T20:32:56.7610326Z         T: int,
2025-05-07T20:32:56.7610517Z         D: int,
2025-05-07T20:32:56.7610740Z         scale_ub: Optional[float],
2025-05-07T20:32:56.7611094Z         contiguous: bool,
2025-05-07T20:32:56.7611334Z         compiled: bool,
2025-05-07T20:32:56.7611560Z     ) -> None:
2025-05-07T20:32:56.7611779Z         torch.manual_seed(2025)
2025-05-07T20:32:56.7612082Z     
2025-05-07T20:32:56.7612354Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.7612698Z     
2025-05-07T20:32:56.7612892Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.7613179Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.7613487Z         x = x_sign * x_clamp
2025-05-07T20:32:56.7613726Z         x0 = x[:, :D]
2025-05-07T20:32:56.7613941Z         x1 = x[:, D:]
2025-05-07T20:32:56.7614166Z     
2025-05-07T20:32:56.7614387Z         if contiguous:
2025-05-07T20:32:56.7614620Z             x0 = x0.contiguous()
2025-05-07T20:32:56.7614882Z             x1 = x1.contiguous()
2025-05-07T20:32:56.7615193Z     
2025-05-07T20:32:56.7615389Z         if scale_ub is not None:
2025-05-07T20:32:56.7615668Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.7616012Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.7616321Z             )
2025-05-07T20:32:56.7616515Z         else:
2025-05-07T20:32:56.7616728Z             scale_ub_tensor = None
2025-05-07T20:32:56.7616975Z     
2025-05-07T20:32:56.7617209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.7617526Z             op = silu_mul_quant
2025-05-07T20:32:56.7617774Z             if compiled:
2025-05-07T20:32:56.7618019Z                 op = torch.compile(op)
2025-05-07T20:32:56.7618416Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.7618698Z     
2025-05-07T20:32:56.7618890Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.7619062Z 
2025-05-07T20:32:56.7619162Z moe/activation_test.py:117: 
2025-05-07T20:32:56.7619462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.7619800Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.7620088Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.7620785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.7621487Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.7622022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.7622710Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.7623377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.7623956Z     kernel = self.compile(
2025-05-07T20:32:56.7624513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.7625179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.7625579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.7625804Z 
2025-05-07T20:32:56.7626015Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5058f43a0>
2025-05-07T20:32:56.7627106Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.7628527Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5058157e0>}
2025-05-07T20:32:56.7629885Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.7630924Z context = <triton._C.libtriton.ir.context object at 0x7fd5057c57b0>
2025-05-07T20:32:56.7631258Z 
2025-05-07T20:32:56.7631432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.7631955Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.7632471Z                            module_map=module_map)
2025-05-07T20:32:56.7632842Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.7633196Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.7633457Z E       ^
2025-05-07T20:32:56.7633977Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.7634430Z 
2025-05-07T20:32:56.7634867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.7635384Z 
2025-05-07T20:32:56.7635537Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.7635958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.7636361Z     T=4096,
2025-05-07T20:32:56.7636550Z     D=7168,
2025-05-07T20:32:56.7636744Z     scale_ub=1200.0,
2025-05-07T20:32:56.7636975Z     contiguous=False,
2025-05-07T20:32:56.7637204Z     compiled=False,
2025-05-07T20:32:56.7637414Z )
2025-05-07T20:32:56.7637738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.7638242Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:56.7638520Z 
2025-05-07T20:32:56.7638599Z     @given(
2025-05-07T20:32:56.7638833Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.7639147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.7639459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.7639789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.7640122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.7640404Z     )
2025-05-07T20:32:56.7640760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.7641206Z     def test_silu_mul_quant(
2025-05-07T20:32:56.7641447Z         self,
2025-05-07T20:32:56.7641638Z         T: int,
2025-05-07T20:32:56.7641836Z         D: int,
2025-05-07T20:32:56.7642055Z         scale_ub: Optional[float],
2025-05-07T20:32:56.7642325Z         contiguous: bool,
2025-05-07T20:32:56.7642567Z         compiled: bool,
2025-05-07T20:32:56.7642791Z     ) -> None:
2025-05-07T20:32:56.7643003Z         torch.manual_seed(2025)
2025-05-07T20:32:56.7643244Z     
2025-05-07T20:32:56.7643518Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.7643858Z     
2025-05-07T20:32:56.7644122Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.7644440Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.7644745Z         x = x_sign * x_clamp
2025-05-07T20:32:56.7644985Z         x0 = x[:, :D]
2025-05-07T20:32:56.7645204Z         x1 = x[:, D:]
2025-05-07T20:32:56.7645412Z     
2025-05-07T20:32:56.7645599Z         if contiguous:
2025-05-07T20:32:56.7645830Z             x0 = x0.contiguous()
2025-05-07T20:32:56.7646082Z             x1 = x1.contiguous()
2025-05-07T20:32:56.7646327Z     
2025-05-07T20:32:56.7646521Z         if scale_ub is not None:
2025-05-07T20:32:56.7646793Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.7647125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.7647432Z             )
2025-05-07T20:32:56.7647629Z         else:
2025-05-07T20:32:56.7647836Z             scale_ub_tensor = None
2025-05-07T20:32:56.7648087Z     
2025-05-07T20:32:56.7648325Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.7648633Z             op = silu_mul_quant
2025-05-07T20:32:56.7648883Z             if compiled:
2025-05-07T20:32:56.7649133Z                 op = torch.compile(op)
2025-05-07T20:32:56.7649433Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.7649760Z     
2025-05-07T20:32:56.7649957Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.7650121Z 
2025-05-07T20:32:56.7650218Z moe/activation_test.py:117: 
2025-05-07T20:32:56.7650512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.7650883Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.7651166Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.7651857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.7652559Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.7653106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.7653790Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.7654505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.7655043Z     kernel = self.compile(
2025-05-07T20:32:56.7655768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.7656438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.7656843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.7657068Z 
2025-05-07T20:32:56.7657281Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5058faa40>
2025-05-07T20:32:56.7658418Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.7659799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505816200>}
2025-05-07T20:32:56.7661162Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.7662197Z context = <triton._C.libtriton.ir.context object at 0x7fd505758cb0>
2025-05-07T20:32:56.7662486Z 
2025-05-07T20:32:56.7662659Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.7663179Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.7663650Z                            module_map=module_map)
2025-05-07T20:32:56.7664139Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.7664497Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.7664750Z E       ^
2025-05-07T20:32:56.7665222Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.7665678Z 
2025-05-07T20:32:56.7666103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.7666619Z 
2025-05-07T20:32:56.7666733Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.7667143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.7667542Z     T=16384,
2025-05-07T20:32:56.7667738Z     D=7168,
2025-05-07T20:32:56.7667926Z     scale_ub=None,
2025-05-07T20:32:56.7668140Z     contiguous=True,
2025-05-07T20:32:56.7668362Z     compiled=True,
2025-05-07T20:32:56.7668562Z )
2025-05-07T20:32:56.9613822Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.9614423Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:56.9614797Z 
2025-05-07T20:32:56.9614938Z     @given(
2025-05-07T20:32:56.9615262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.9615836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.9616216Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.9616564Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.9616977Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.9617271Z     )
2025-05-07T20:32:56.9617638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.9618158Z     def test_silu_mul_quant(
2025-05-07T20:32:56.9618409Z         self,
2025-05-07T20:32:56.9618612Z         T: int,
2025-05-07T20:32:56.9618810Z         D: int,
2025-05-07T20:32:56.9619041Z         scale_ub: Optional[float],
2025-05-07T20:32:56.9619329Z         contiguous: bool,
2025-05-07T20:32:56.9619573Z         compiled: bool,
2025-05-07T20:32:56.9619809Z     ) -> None:
2025-05-07T20:32:56.9620036Z         torch.manual_seed(2025)
2025-05-07T20:32:56.9620359Z     
2025-05-07T20:32:56.9620651Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.9621037Z     
2025-05-07T20:32:56.9621243Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.9621546Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.9621863Z         x = x_sign * x_clamp
2025-05-07T20:32:56.9622115Z         x0 = x[:, :D]
2025-05-07T20:32:56.9622339Z         x1 = x[:, D:]
2025-05-07T20:32:56.9622551Z     
2025-05-07T20:32:56.9622745Z         if contiguous:
2025-05-07T20:32:56.9622989Z             x0 = x0.contiguous()
2025-05-07T20:32:56.9623257Z             x1 = x1.contiguous()
2025-05-07T20:32:56.9623504Z     
2025-05-07T20:32:56.9623708Z         if scale_ub is not None:
2025-05-07T20:32:56.9623996Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.9624394Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.9624720Z             )
2025-05-07T20:32:56.9624926Z         else:
2025-05-07T20:32:56.9625141Z             scale_ub_tensor = None
2025-05-07T20:32:56.9625405Z     
2025-05-07T20:32:56.9625656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.9625976Z             op = silu_mul_quant
2025-05-07T20:32:56.9626238Z             if compiled:
2025-05-07T20:32:56.9626494Z                 op = torch.compile(op)
2025-05-07T20:32:56.9626798Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.9627091Z     
2025-05-07T20:32:56.9627292Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.9627463Z 
2025-05-07T20:32:56.9627573Z moe/activation_test.py:117: 
2025-05-07T20:32:56.9627873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.9628213Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.9628573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.9629154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:56.9629736Z     return fn(*args, **kwargs)
2025-05-07T20:32:56.9630425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.9631131Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.9631684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.9632388Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.9633071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.9633618Z     kernel = self.compile(
2025-05-07T20:32:56.9634179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.9634857Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.9635257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.9635495Z 
2025-05-07T20:32:56.9635757Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5057122f0>
2025-05-07T20:32:56.9636869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.9638314Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505817760>}
2025-05-07T20:32:56.9639684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.9640730Z context = <triton._C.libtriton.ir.context object at 0x7fd50576dcf0>
2025-05-07T20:32:56.9641067Z 
2025-05-07T20:32:56.9641237Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.9641774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.9642253Z                            module_map=module_map)
2025-05-07T20:32:56.9642627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.9642991Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.9643256Z E       ^
2025-05-07T20:32:56.9643725Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.9644187Z 
2025-05-07T20:32:56.9644612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.9645141Z 
2025-05-07T20:32:56.9645246Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.9645671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.9646075Z     T=4096,
2025-05-07T20:32:56.9646268Z     D=5120,
2025-05-07T20:32:56.9646468Z     scale_ub=None,
2025-05-07T20:32:56.9646684Z     contiguous=False,
2025-05-07T20:32:56.9646914Z     compiled=True,
2025-05-07T20:32:56.9647120Z )
2025-05-07T20:32:56.9647440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.9647946Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:56.9648220Z 
2025-05-07T20:32:56.9648303Z     @given(
2025-05-07T20:32:56.9648532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.9648850Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.9649160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.9649544Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.9649875Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.9650164Z     )
2025-05-07T20:32:56.9650526Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.9650973Z     def test_silu_mul_quant(
2025-05-07T20:32:56.9651219Z         self,
2025-05-07T20:32:56.9651417Z         T: int,
2025-05-07T20:32:56.9651615Z         D: int,
2025-05-07T20:32:56.9651841Z         scale_ub: Optional[float],
2025-05-07T20:32:56.9652122Z         contiguous: bool,
2025-05-07T20:32:56.9652360Z         compiled: bool,
2025-05-07T20:32:56.9652585Z     ) -> None:
2025-05-07T20:32:56.9652809Z         torch.manual_seed(2025)
2025-05-07T20:32:56.9653052Z     
2025-05-07T20:32:56.9653330Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.9653677Z     
2025-05-07T20:32:56.9653867Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.9654166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.9654478Z         x = x_sign * x_clamp
2025-05-07T20:32:56.9654722Z         x0 = x[:, :D]
2025-05-07T20:32:56.9654935Z         x1 = x[:, D:]
2025-05-07T20:32:56.9655150Z     
2025-05-07T20:32:56.9655338Z         if contiguous:
2025-05-07T20:32:56.9656159Z             x0 = x0.contiguous()
2025-05-07T20:32:56.9656492Z             x1 = x1.contiguous()
2025-05-07T20:32:56.9656729Z     
2025-05-07T20:32:56.9656920Z         if scale_ub is not None:
2025-05-07T20:32:56.9657201Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.9657616Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.9657920Z             )
2025-05-07T20:32:56.9658190Z         else:
2025-05-07T20:32:56.9658401Z             scale_ub_tensor = None
2025-05-07T20:32:56.9658650Z     
2025-05-07T20:32:56.9658887Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.9659200Z             op = silu_mul_quant
2025-05-07T20:32:56.9659449Z             if compiled:
2025-05-07T20:32:56.9659704Z                 op = torch.compile(op)
2025-05-07T20:32:56.9660003Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.9660353Z     
2025-05-07T20:32:56.9660542Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:56.9660715Z 
2025-05-07T20:32:56.9660817Z moe/activation_test.py:117: 
2025-05-07T20:32:56.9661113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.9661439Z moe/activation_test.py:115: in fn
2025-05-07T20:32:56.9661725Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.9662296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:56.9662857Z     return fn(*args, **kwargs)
2025-05-07T20:32:56.9663529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:56.9664232Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:56.9664776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.9665463Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.9666142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.9666679Z     kernel = self.compile(
2025-05-07T20:32:56.9667227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.9667894Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.9668291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.9668517Z 
2025-05-07T20:32:56.9668732Z self = <triton.compiler.compiler.ASTSource object at 0x7fd50570e980>
2025-05-07T20:32:56.9669898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.9671314Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a4280>}
2025-05-07T20:32:56.9672681Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.9673737Z context = <triton._C.libtriton.ir.context object at 0x7fd6da1908f0>
2025-05-07T20:32:56.9674069Z 
2025-05-07T20:32:56.9674241Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.9674764Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.9675242Z                            module_map=module_map)
2025-05-07T20:32:56.9675610Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.9675965Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.9676221Z E       ^
2025-05-07T20:32:56.9676737Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.9677196Z 
2025-05-07T20:32:56.9677622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.9678180Z 
2025-05-07T20:32:57.2924903Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.2925824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.2926643Z     T=4096,
2025-05-07T20:32:57.2927018Z     D=5120,
2025-05-07T20:32:57.2927410Z     scale_ub=1200.0,
2025-05-07T20:32:57.2927857Z     contiguous=False,
2025-05-07T20:32:57.2928306Z     compiled=False,
2025-05-07T20:32:57.2928716Z )
2025-05-07T20:32:57.2929370Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.2930376Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:57.2931115Z 
2025-05-07T20:32:57.2931280Z     @given(
2025-05-07T20:32:57.2931740Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.2932373Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.2932995Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.2933570Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.2933908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.2934211Z     )
2025-05-07T20:32:57.2934566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.2935016Z     def test_silu_mul_quant(
2025-05-07T20:32:57.2935266Z         self,
2025-05-07T20:32:57.2935463Z         T: int,
2025-05-07T20:32:57.2935661Z         D: int,
2025-05-07T20:32:57.2935889Z         scale_ub: Optional[float],
2025-05-07T20:32:57.2936162Z         contiguous: bool,
2025-05-07T20:32:57.2936407Z         compiled: bool,
2025-05-07T20:32:57.2936642Z     ) -> None:
2025-05-07T20:32:57.2936861Z         torch.manual_seed(2025)
2025-05-07T20:32:57.2937109Z     
2025-05-07T20:32:57.2937391Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.2937735Z     
2025-05-07T20:32:57.2937939Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.2938334Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.2938652Z         x = x_sign * x_clamp
2025-05-07T20:32:57.2938892Z         x0 = x[:, :D]
2025-05-07T20:32:57.2939113Z         x1 = x[:, D:]
2025-05-07T20:32:57.2939326Z     
2025-05-07T20:32:57.2939513Z         if contiguous:
2025-05-07T20:32:57.2939747Z             x0 = x0.contiguous()
2025-05-07T20:32:57.2940010Z             x1 = x1.contiguous()
2025-05-07T20:32:57.2940250Z     
2025-05-07T20:32:57.2940448Z         if scale_ub is not None:
2025-05-07T20:32:57.2940827Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.2941165Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.2941484Z             )
2025-05-07T20:32:57.2941682Z         else:
2025-05-07T20:32:57.2941895Z             scale_ub_tensor = None
2025-05-07T20:32:57.2942156Z     
2025-05-07T20:32:57.2942400Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.2942715Z             op = silu_mul_quant
2025-05-07T20:32:57.2942966Z             if compiled:
2025-05-07T20:32:57.2943221Z                 op = torch.compile(op)
2025-05-07T20:32:57.2943527Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.2943804Z     
2025-05-07T20:32:57.2944029Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.2944221Z 
2025-05-07T20:32:57.2944329Z moe/activation_test.py:117: 
2025-05-07T20:32:57.2944626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.2944959Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.2945246Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.2945946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.2946715Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.2947262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.2947954Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.2948684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.2949224Z     kernel = self.compile(
2025-05-07T20:32:57.2949779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.2950442Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.2950842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.2951074Z 
2025-05-07T20:32:57.2951290Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da1021d0>
2025-05-07T20:32:57.2952435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.2953836Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a5000>}
2025-05-07T20:32:57.2955196Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.2956484Z context = <triton._C.libtriton.ir.context object at 0x7fd6da161e70>
2025-05-07T20:32:57.2956782Z 
2025-05-07T20:32:57.2956951Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.2957486Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.2957962Z                            module_map=module_map)
2025-05-07T20:32:57.2958336Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.2958698Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.2958967Z E       ^
2025-05-07T20:32:57.2959437Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.2959897Z 
2025-05-07T20:32:57.2960319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.2960837Z 
2025-05-07T20:32:57.2960950Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.2961465Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.2961872Z     T=4096,
2025-05-07T20:32:57.2962067Z     D=5120,
2025-05-07T20:32:57.2962263Z     scale_ub=1200.0,
2025-05-07T20:32:57.2962504Z     contiguous=False,
2025-05-07T20:32:57.2962734Z     compiled=True,
2025-05-07T20:32:57.2962948Z )
2025-05-07T20:32:57.2963307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.2963807Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:57.2964089Z 
2025-05-07T20:32:57.2964170Z     @given(
2025-05-07T20:32:57.2964407Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.2964726Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.2965042Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.2965378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.2965709Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.2966007Z     )
2025-05-07T20:32:57.2966365Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.2966807Z     def test_silu_mul_quant(
2025-05-07T20:32:57.2967062Z         self,
2025-05-07T20:32:57.2967262Z         T: int,
2025-05-07T20:32:57.2967526Z         D: int,
2025-05-07T20:32:57.2967761Z         scale_ub: Optional[float],
2025-05-07T20:32:57.2968037Z         contiguous: bool,
2025-05-07T20:32:57.2968284Z         compiled: bool,
2025-05-07T20:32:57.2968511Z     ) -> None:
2025-05-07T20:32:57.2968794Z         torch.manual_seed(2025)
2025-05-07T20:32:57.2969042Z     
2025-05-07T20:32:57.2969321Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.2969674Z     
2025-05-07T20:32:57.2969881Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.2970175Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.2970492Z         x = x_sign * x_clamp
2025-05-07T20:32:57.2970737Z         x0 = x[:, :D]
2025-05-07T20:32:57.2970957Z         x1 = x[:, D:]
2025-05-07T20:32:57.2971176Z     
2025-05-07T20:32:57.2971370Z         if contiguous:
2025-05-07T20:32:57.2971604Z             x0 = x0.contiguous()
2025-05-07T20:32:57.2971937Z             x1 = x1.contiguous()
2025-05-07T20:32:57.2972184Z     
2025-05-07T20:32:57.2972387Z         if scale_ub is not None:
2025-05-07T20:32:57.2972666Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.2973014Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.2973333Z             )
2025-05-07T20:32:57.2973527Z         else:
2025-05-07T20:32:57.2973746Z             scale_ub_tensor = None
2025-05-07T20:32:57.2974001Z     
2025-05-07T20:32:57.2974235Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.2974552Z             op = silu_mul_quant
2025-05-07T20:32:57.2974817Z             if compiled:
2025-05-07T20:32:57.2975065Z                 op = torch.compile(op)
2025-05-07T20:32:57.2975373Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.2975651Z     
2025-05-07T20:32:57.2975852Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.2976025Z 
2025-05-07T20:32:57.2976132Z moe/activation_test.py:117: 
2025-05-07T20:32:57.2976430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.2976756Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.2977043Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.2977609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.2978220Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.2978883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.2979582Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.2980126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.2980916Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.2981591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.2982135Z     kernel = self.compile(
2025-05-07T20:32:57.2982689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.2983349Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.2983751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.2983978Z 
2025-05-07T20:32:57.2984193Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5058fbd90>
2025-05-07T20:32:57.2985293Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.2986681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a4700>}
2025-05-07T20:32:57.2988095Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.2989144Z context = <triton._C.libtriton.ir.context object at 0x7fd5056924f0>
2025-05-07T20:32:57.2989476Z 
2025-05-07T20:32:57.2989652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.2990180Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.2990663Z                            module_map=module_map)
2025-05-07T20:32:57.2991036Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.2991401Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.2991662Z E       ^
2025-05-07T20:32:57.2992137Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.2992638Z 
2025-05-07T20:32:57.2993069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.2993587Z 
2025-05-07T20:32:57.4260696Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.4261201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.4261650Z     T=2048,
2025-05-07T20:32:57.4261869Z     D=7168,
2025-05-07T20:32:57.4262076Z     scale_ub=1200.0,
2025-05-07T20:32:57.4262307Z     contiguous=False,
2025-05-07T20:32:57.4262540Z     compiled=False,
2025-05-07T20:32:57.4262752Z )
2025-05-07T20:32:57.4263075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.4263584Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:57.4263862Z 
2025-05-07T20:32:57.4263949Z     @given(
2025-05-07T20:32:57.4264191Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.4264508Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.4264825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.4265161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.4265490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.4265781Z     )
2025-05-07T20:32:57.4266136Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.4266578Z     def test_silu_mul_quant(
2025-05-07T20:32:57.4266823Z         self,
2025-05-07T20:32:57.4267022Z         T: int,
2025-05-07T20:32:57.4267218Z         D: int,
2025-05-07T20:32:57.4267442Z         scale_ub: Optional[float],
2025-05-07T20:32:57.4267724Z         contiguous: bool,
2025-05-07T20:32:57.4268077Z         compiled: bool,
2025-05-07T20:32:57.4268313Z     ) -> None:
2025-05-07T20:32:57.4268536Z         torch.manual_seed(2025)
2025-05-07T20:32:57.4268779Z     
2025-05-07T20:32:57.4269055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.4269403Z     
2025-05-07T20:32:57.4269605Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.4269899Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.4270212Z         x = x_sign * x_clamp
2025-05-07T20:32:57.4270454Z         x0 = x[:, :D]
2025-05-07T20:32:57.4270671Z         x1 = x[:, D:]
2025-05-07T20:32:57.4270880Z     
2025-05-07T20:32:57.4271071Z         if contiguous:
2025-05-07T20:32:57.4271304Z             x0 = x0.contiguous()
2025-05-07T20:32:57.4271568Z             x1 = x1.contiguous()
2025-05-07T20:32:57.4271813Z     
2025-05-07T20:32:57.4272004Z         if scale_ub is not None:
2025-05-07T20:32:57.4272283Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.4272631Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.4272939Z             )
2025-05-07T20:32:57.4273140Z         else:
2025-05-07T20:32:57.4273359Z             scale_ub_tensor = None
2025-05-07T20:32:57.4280081Z     
2025-05-07T20:32:57.4280458Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.4280793Z             op = silu_mul_quant
2025-05-07T20:32:57.4281045Z             if compiled:
2025-05-07T20:32:57.4281289Z                 op = torch.compile(op)
2025-05-07T20:32:57.4281587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.4281925Z     
2025-05-07T20:32:57.4282113Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.4282284Z 
2025-05-07T20:32:57.4282383Z moe/activation_test.py:117: 
2025-05-07T20:32:57.4282674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.4283007Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.4283284Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.4283982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.4284749Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.4285293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.4285983Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.4286648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.4287191Z     kernel = self.compile(
2025-05-07T20:32:57.4287732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.4288398Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.4288792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.4289041Z 
2025-05-07T20:32:57.4289259Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da184af0>
2025-05-07T20:32:57.4290359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.4291751Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a5240>}
2025-05-07T20:32:57.4293116Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.4294207Z context = <triton._C.libtriton.ir.context object at 0x7fd505613930>
2025-05-07T20:32:57.4294496Z 
2025-05-07T20:32:57.4294715Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.4295240Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.4295720Z                            module_map=module_map)
2025-05-07T20:32:57.4296106Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.4296464Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.4296719Z E       ^
2025-05-07T20:32:57.4297190Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.4297649Z 
2025-05-07T20:32:57.4298154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.4298672Z 
2025-05-07T20:32:57.4298789Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.4299206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.4299609Z     T=1,
2025-05-07T20:32:57.4299802Z     D=7168,
2025-05-07T20:32:57.4299993Z     scale_ub=None,
2025-05-07T20:32:57.4300213Z     contiguous=True,
2025-05-07T20:32:57.4300442Z     compiled=False,
2025-05-07T20:32:57.4300646Z )
2025-05-07T20:32:57.4300971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.4301513Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:57.4301776Z 
2025-05-07T20:32:57.4301853Z     @given(
2025-05-07T20:32:57.4302087Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.4302445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.4302755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.4303085Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.4303418Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.4303706Z     )
2025-05-07T20:32:57.4304057Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.4304556Z     def test_silu_mul_quant(
2025-05-07T20:32:57.4304796Z         self,
2025-05-07T20:32:57.4304988Z         T: int,
2025-05-07T20:32:57.4305232Z         D: int,
2025-05-07T20:32:57.4305456Z         scale_ub: Optional[float],
2025-05-07T20:32:57.4305726Z         contiguous: bool,
2025-05-07T20:32:57.4305969Z         compiled: bool,
2025-05-07T20:32:57.4306191Z     ) -> None:
2025-05-07T20:32:57.4306411Z         torch.manual_seed(2025)
2025-05-07T20:32:57.4306652Z     
2025-05-07T20:32:57.4306923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.4307268Z     
2025-05-07T20:32:57.4307461Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.4307755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.4308059Z         x = x_sign * x_clamp
2025-05-07T20:32:57.4308301Z         x0 = x[:, :D]
2025-05-07T20:32:57.4308515Z         x1 = x[:, D:]
2025-05-07T20:32:57.4308717Z     
2025-05-07T20:32:57.4308903Z         if contiguous:
2025-05-07T20:32:57.4309137Z             x0 = x0.contiguous()
2025-05-07T20:32:57.4309391Z             x1 = x1.contiguous()
2025-05-07T20:32:57.4309634Z     
2025-05-07T20:32:57.4309831Z         if scale_ub is not None:
2025-05-07T20:32:57.4310102Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.4310444Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.4310751Z             )
2025-05-07T20:32:57.4310943Z         else:
2025-05-07T20:32:57.4311154Z             scale_ub_tensor = None
2025-05-07T20:32:57.4311413Z     
2025-05-07T20:32:57.4311644Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.4311962Z             op = silu_mul_quant
2025-05-07T20:32:57.4312211Z             if compiled:
2025-05-07T20:32:57.4312461Z                 op = torch.compile(op)
2025-05-07T20:32:57.4312756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.4313031Z     
2025-05-07T20:32:57.4313224Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.4313438Z 
2025-05-07T20:32:57.4313541Z moe/activation_test.py:117: 
2025-05-07T20:32:57.4313836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.4314164Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.4314441Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.4315140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.4315839Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.4316381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.4317063Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.4317737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.4318275Z     kernel = self.compile(
2025-05-07T20:32:57.4318819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.4319480Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.4319924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.4320151Z 
2025-05-07T20:32:57.4320363Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505640160>
2025-05-07T20:32:57.4321447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.4322875Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a6050>}
2025-05-07T20:32:57.4324296Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.4325376Z context = <triton._C.libtriton.ir.context object at 0x7fd5056fb8b0>
2025-05-07T20:32:57.4325665Z 
2025-05-07T20:32:57.4325837Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.4326358Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.4326834Z                            module_map=module_map)
2025-05-07T20:32:57.4327202Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.4327556Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.4327815Z E       ^
2025-05-07T20:32:57.4328280Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.4328732Z 
2025-05-07T20:32:57.4329160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.4329676Z 
2025-05-07T20:32:57.4329780Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.4330199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.4330604Z     T=16384,
2025-05-07T20:32:57.4330793Z     D=7168,
2025-05-07T20:32:57.4330989Z     scale_ub=1200.0,
2025-05-07T20:32:57.4331215Z     contiguous=False,
2025-05-07T20:32:57.4331436Z     compiled=True,
2025-05-07T20:32:57.6949127Z )
2025-05-07T20:32:57.6950309Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.6951707Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:57.6952299Z 
2025-05-07T20:32:57.6952463Z     @given(
2025-05-07T20:32:57.6952944Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.6953496Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.6953959Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.6954307Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.6954646Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.6954947Z     )
2025-05-07T20:32:57.6955321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.6956014Z     def test_silu_mul_quant(
2025-05-07T20:32:57.6956268Z         self,
2025-05-07T20:32:57.6956475Z         T: int,
2025-05-07T20:32:57.6956675Z         D: int,
2025-05-07T20:32:57.6956909Z         scale_ub: Optional[float],
2025-05-07T20:32:57.6957191Z         contiguous: bool,
2025-05-07T20:32:57.6957439Z         compiled: bool,
2025-05-07T20:32:57.6957668Z     ) -> None:
2025-05-07T20:32:57.6957893Z         torch.manual_seed(2025)
2025-05-07T20:32:57.6958142Z     
2025-05-07T20:32:57.6958423Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.6958777Z     
2025-05-07T20:32:57.6958985Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.6959281Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.6959601Z         x = x_sign * x_clamp
2025-05-07T20:32:57.6959854Z         x0 = x[:, :D]
2025-05-07T20:32:57.6960074Z         x1 = x[:, D:]
2025-05-07T20:32:57.6960290Z     
2025-05-07T20:32:57.6960561Z         if contiguous:
2025-05-07T20:32:57.6960804Z             x0 = x0.contiguous()
2025-05-07T20:32:57.6961072Z             x1 = x1.contiguous()
2025-05-07T20:32:57.6961321Z     
2025-05-07T20:32:57.6961580Z         if scale_ub is not None:
2025-05-07T20:32:57.6961864Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.6962211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.6962521Z             )
2025-05-07T20:32:57.6962724Z         else:
2025-05-07T20:32:57.6962943Z             scale_ub_tensor = None
2025-05-07T20:32:57.6963204Z     
2025-05-07T20:32:57.6963443Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.6963768Z             op = silu_mul_quant
2025-05-07T20:32:57.6964025Z             if compiled:
2025-05-07T20:32:57.6964277Z                 op = torch.compile(op)
2025-05-07T20:32:57.6964654Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.6964933Z     
2025-05-07T20:32:57.6965134Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.6965310Z 
2025-05-07T20:32:57.6965416Z moe/activation_test.py:117: 
2025-05-07T20:32:57.6965713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.6966052Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.6966349Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.6966926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.6967504Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.6968173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.6968883Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.6969434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.6970294Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.6970961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.6971501Z     kernel = self.compile(
2025-05-07T20:32:57.6972054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.6972713Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.6973109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.6973334Z 
2025-05-07T20:32:57.6973549Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5056015d0>
2025-05-07T20:32:57.6974721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.6976135Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a7490>}
2025-05-07T20:32:57.6977508Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.6978616Z context = <triton._C.libtriton.ir.context object at 0x7fd505aa97f0>
2025-05-07T20:32:57.6978911Z 
2025-05-07T20:32:57.6979081Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.6979612Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.6980082Z                            module_map=module_map)
2025-05-07T20:32:57.6980448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.6980807Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.6981066Z E       ^
2025-05-07T20:32:57.6981602Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.6982066Z 
2025-05-07T20:32:57.6982487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.6983050Z 
2025-05-07T20:32:57.6983160Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.6983573Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.6983976Z     T=1,
2025-05-07T20:32:57.6984164Z     D=7168,
2025-05-07T20:32:57.6984362Z     scale_ub=None,
2025-05-07T20:32:57.6984575Z     contiguous=False,
2025-05-07T20:32:57.6984810Z     compiled=False,
2025-05-07T20:32:57.6985018Z )
2025-05-07T20:32:57.6985336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.6985868Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:57.6986132Z 
2025-05-07T20:32:57.6986216Z     @given(
2025-05-07T20:32:57.6986442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.6986761Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.6987069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.6987398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.6987729Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.6988016Z     )
2025-05-07T20:32:57.6988373Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.6988811Z     def test_silu_mul_quant(
2025-05-07T20:32:57.6989057Z         self,
2025-05-07T20:32:57.6989252Z         T: int,
2025-05-07T20:32:57.6989446Z         D: int,
2025-05-07T20:32:57.6989664Z         scale_ub: Optional[float],
2025-05-07T20:32:57.6989936Z         contiguous: bool,
2025-05-07T20:32:57.6990175Z         compiled: bool,
2025-05-07T20:32:57.6990399Z     ) -> None:
2025-05-07T20:32:57.6990621Z         torch.manual_seed(2025)
2025-05-07T20:32:57.6990860Z     
2025-05-07T20:32:57.6991137Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.6991475Z     
2025-05-07T20:32:57.6991663Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.6991960Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.6992266Z         x = x_sign * x_clamp
2025-05-07T20:32:57.6992501Z         x0 = x[:, :D]
2025-05-07T20:32:57.6992716Z         x1 = x[:, D:]
2025-05-07T20:32:57.6992922Z     
2025-05-07T20:32:57.6993111Z         if contiguous:
2025-05-07T20:32:57.6993339Z             x0 = x0.contiguous()
2025-05-07T20:32:57.6993598Z             x1 = x1.contiguous()
2025-05-07T20:32:57.6993884Z     
2025-05-07T20:32:57.6994075Z         if scale_ub is not None:
2025-05-07T20:32:57.6994353Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.6994695Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.6994999Z             )
2025-05-07T20:32:57.6995198Z         else:
2025-05-07T20:32:57.6995411Z             scale_ub_tensor = None
2025-05-07T20:32:57.6995660Z     
2025-05-07T20:32:57.6995896Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.6996211Z             op = silu_mul_quant
2025-05-07T20:32:57.6996459Z             if compiled:
2025-05-07T20:32:57.6996711Z                 op = torch.compile(op)
2025-05-07T20:32:57.6997009Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.6997276Z     
2025-05-07T20:32:57.6997473Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.6997643Z 
2025-05-07T20:32:57.6997744Z moe/activation_test.py:117: 
2025-05-07T20:32:57.6998047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.6998368Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.6998650Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.6999395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.7000091Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.7000632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.7001366Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.7002032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.7002563Z     kernel = self.compile(
2025-05-07T20:32:57.7003110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.7003776Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.7004167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.7004441Z 
2025-05-07T20:32:57.7004655Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da13fb20>
2025-05-07T20:32:57.7005748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.7007143Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a77f0>}
2025-05-07T20:32:57.7008507Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.7009544Z context = <triton._C.libtriton.ir.context object at 0x7fd505ae1af0>
2025-05-07T20:32:57.7009839Z 
2025-05-07T20:32:57.7010010Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.7010544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.7011017Z                            module_map=module_map)
2025-05-07T20:32:57.7011378Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.7011737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.7011998Z E       ^
2025-05-07T20:32:57.7012463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.7012922Z 
2025-05-07T20:32:57.7013341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.7013864Z 
2025-05-07T20:32:57.7014018Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.7014444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.7014848Z     T=2048,
2025-05-07T20:32:57.7015041Z     D=7168,
2025-05-07T20:32:57.7015235Z     scale_ub=None,
2025-05-07T20:32:57.7015452Z     contiguous=False,
2025-05-07T20:32:57.7015681Z     compiled=True,
2025-05-07T20:32:57.7015885Z )
2025-05-07T20:32:57.8006291Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.8007831Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:57.8008614Z 
2025-05-07T20:32:57.8008826Z     @given(
2025-05-07T20:32:57.8009453Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.8010087Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.8010717Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.8011392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.8012065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.8012639Z     )
2025-05-07T20:32:57.8013347Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.8014197Z     def test_silu_mul_quant(
2025-05-07T20:32:57.8014480Z         self,
2025-05-07T20:32:57.8014800Z         T: int,
2025-05-07T20:32:57.8015008Z         D: int,
2025-05-07T20:32:57.8015231Z         scale_ub: Optional[float],
2025-05-07T20:32:57.8015515Z         contiguous: bool,
2025-05-07T20:32:57.8015850Z         compiled: bool,
2025-05-07T20:32:57.8016077Z     ) -> None:
2025-05-07T20:32:57.8016302Z         torch.manual_seed(2025)
2025-05-07T20:32:57.8016551Z     
2025-05-07T20:32:57.8016834Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.8017185Z     
2025-05-07T20:32:57.8017390Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.8017685Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.8018007Z         x = x_sign * x_clamp
2025-05-07T20:32:57.8018324Z         x0 = x[:, :D]
2025-05-07T20:32:57.8018547Z         x1 = x[:, D:]
2025-05-07T20:32:57.8018766Z     
2025-05-07T20:32:57.8019034Z         if contiguous:
2025-05-07T20:32:57.8019269Z             x0 = x0.contiguous()
2025-05-07T20:32:57.8019540Z             x1 = x1.contiguous()
2025-05-07T20:32:57.8019790Z     
2025-05-07T20:32:57.8019986Z         if scale_ub is not None:
2025-05-07T20:32:57.8020270Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.8020620Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.8020934Z             )
2025-05-07T20:32:57.8021131Z         else:
2025-05-07T20:32:57.8021350Z             scale_ub_tensor = None
2025-05-07T20:32:57.8021607Z     
2025-05-07T20:32:57.8021843Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.8022169Z             op = silu_mul_quant
2025-05-07T20:32:57.8022425Z             if compiled:
2025-05-07T20:32:57.8022679Z                 op = torch.compile(op)
2025-05-07T20:32:57.8022985Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.8023267Z     
2025-05-07T20:32:57.8023464Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.8023640Z 
2025-05-07T20:32:57.8023743Z moe/activation_test.py:117: 
2025-05-07T20:32:57.8024044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.8024415Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.8024716Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.8025291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.8025859Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.8026525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.8027226Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.8027896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.8028591Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.8029266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.8029806Z     kernel = self.compile(
2025-05-07T20:32:57.8030359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.8031026Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.8031433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.8031667Z 
2025-05-07T20:32:57.8031877Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505a327a0>
2025-05-07T20:32:57.8032982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.8034427Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505540af0>}
2025-05-07T20:32:57.8035788Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.8036874Z context = <triton._C.libtriton.ir.context object at 0x7fd50557abf0>
2025-05-07T20:32:57.8037165Z 
2025-05-07T20:32:57.8037343Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.8037880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.8038358Z                            module_map=module_map)
2025-05-07T20:32:57.8038734Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.8039095Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.8039353Z E       ^
2025-05-07T20:32:57.8039869Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.8040326Z 
2025-05-07T20:32:57.8040753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.8041273Z 
2025-05-07T20:32:57.8041389Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.8041809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.8042213Z     T=4096,
2025-05-07T20:32:57.8042406Z     D=7168,
2025-05-07T20:32:57.8042606Z     scale_ub=None,
2025-05-07T20:32:57.8048718Z     contiguous=False,
2025-05-07T20:32:57.8048964Z     compiled=True,
2025-05-07T20:32:57.8049164Z )
2025-05-07T20:32:57.8049495Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.8049997Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:57.8050270Z 
2025-05-07T20:32:57.8050348Z     @given(
2025-05-07T20:32:57.8050577Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.8050893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.8051196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.8051524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.8051856Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.8052133Z     )
2025-05-07T20:32:57.8052482Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.8052931Z     def test_silu_mul_quant(
2025-05-07T20:32:57.8053175Z         self,
2025-05-07T20:32:57.8053360Z         T: int,
2025-05-07T20:32:57.8053553Z         D: int,
2025-05-07T20:32:57.8053772Z         scale_ub: Optional[float],
2025-05-07T20:32:57.8054113Z         contiguous: bool,
2025-05-07T20:32:57.8054380Z         compiled: bool,
2025-05-07T20:32:57.8054630Z     ) -> None:
2025-05-07T20:32:57.8054838Z         torch.manual_seed(2025)
2025-05-07T20:32:57.8055082Z     
2025-05-07T20:32:57.8055362Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.8055952Z     
2025-05-07T20:32:57.8056147Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.8056439Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.8056753Z         x = x_sign * x_clamp
2025-05-07T20:32:57.8056987Z         x0 = x[:, :D]
2025-05-07T20:32:57.8057204Z         x1 = x[:, D:]
2025-05-07T20:32:57.8057410Z     
2025-05-07T20:32:57.8057589Z         if contiguous:
2025-05-07T20:32:57.8057821Z             x0 = x0.contiguous()
2025-05-07T20:32:57.8058181Z             x1 = x1.contiguous()
2025-05-07T20:32:57.8058419Z     
2025-05-07T20:32:57.8058612Z         if scale_ub is not None:
2025-05-07T20:32:57.8058889Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.8059227Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.8059532Z             )
2025-05-07T20:32:57.8059728Z         else:
2025-05-07T20:32:57.8059932Z             scale_ub_tensor = None
2025-05-07T20:32:57.8060181Z     
2025-05-07T20:32:57.8060503Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.8060816Z             op = silu_mul_quant
2025-05-07T20:32:57.8061069Z             if compiled:
2025-05-07T20:32:57.8061317Z                 op = torch.compile(op)
2025-05-07T20:32:57.8061672Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.8061942Z     
2025-05-07T20:32:57.8062135Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.8062300Z 
2025-05-07T20:32:57.8062405Z moe/activation_test.py:117: 
2025-05-07T20:32:57.8062691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.8063019Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.8063307Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.8063868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.8064501Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.8065167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.8065861Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.8066397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.8067082Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.8067744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.8068278Z     kernel = self.compile(
2025-05-07T20:32:57.8068829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.8069491Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.8069886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.8070111Z 
2025-05-07T20:32:57.8070322Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505598400>
2025-05-07T20:32:57.8071412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.8072805Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505540280>}
2025-05-07T20:32:57.8074279Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.8075320Z context = <triton._C.libtriton.ir.context object at 0x7fd50551d870>
2025-05-07T20:32:57.8075618Z 
2025-05-07T20:32:57.8075785Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.8076317Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.8076786Z                            module_map=module_map)
2025-05-07T20:32:57.8077153Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.8077523Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.8077783Z E       ^
2025-05-07T20:32:57.8078249Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.8078708Z 
2025-05-07T20:32:57.8079130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.8079642Z 
2025-05-07T20:32:58.1471471Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.1472117Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.1472763Z     T=16384,
2025-05-07T20:32:58.1473036Z     D=5120,
2025-05-07T20:32:58.1473452Z     scale_ub=1200.0,
2025-05-07T20:32:58.1473745Z     contiguous=False,
2025-05-07T20:32:58.1473975Z     compiled=False,
2025-05-07T20:32:58.1474178Z )
2025-05-07T20:32:58.1474542Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.1475136Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.1475423Z 
2025-05-07T20:32:58.1475505Z     @given(
2025-05-07T20:32:58.1475738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.1476057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.1476364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.1476703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.1477035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.1477321Z     )
2025-05-07T20:32:58.1477740Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.1478190Z     def test_silu_mul_quant(
2025-05-07T20:32:58.1478433Z         self,
2025-05-07T20:32:58.1478623Z         T: int,
2025-05-07T20:32:58.1478821Z         D: int,
2025-05-07T20:32:58.1479039Z         scale_ub: Optional[float],
2025-05-07T20:32:58.1479314Z         contiguous: bool,
2025-05-07T20:32:58.1479555Z         compiled: bool,
2025-05-07T20:32:58.1479788Z     ) -> None:
2025-05-07T20:32:58.1480003Z         torch.manual_seed(2025)
2025-05-07T20:32:58.1480249Z     
2025-05-07T20:32:58.1480525Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.1480867Z     
2025-05-07T20:32:58.1481063Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.1481364Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.1481681Z         x = x_sign * x_clamp
2025-05-07T20:32:58.1481914Z         x0 = x[:, :D]
2025-05-07T20:32:58.1482135Z         x1 = x[:, D:]
2025-05-07T20:32:58.1482343Z     
2025-05-07T20:32:58.1482529Z         if contiguous:
2025-05-07T20:32:58.1482764Z             x0 = x0.contiguous()
2025-05-07T20:32:58.1483028Z             x1 = x1.contiguous()
2025-05-07T20:32:58.1483267Z     
2025-05-07T20:32:58.1483461Z         if scale_ub is not None:
2025-05-07T20:32:58.1483741Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.1484076Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.1484390Z             )
2025-05-07T20:32:58.1484588Z         else:
2025-05-07T20:32:58.1484791Z             scale_ub_tensor = None
2025-05-07T20:32:58.1485046Z     
2025-05-07T20:32:58.1485284Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.1485593Z             op = silu_mul_quant
2025-05-07T20:32:58.1485915Z             if compiled:
2025-05-07T20:32:58.1486170Z                 op = torch.compile(op)
2025-05-07T20:32:58.1486464Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1486742Z     
2025-05-07T20:32:58.1486940Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.1487105Z 
2025-05-07T20:32:58.1487214Z moe/activation_test.py:117: 
2025-05-07T20:32:58.1487508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1487841Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.1488132Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1488832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.1489527Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.1490071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.1490761Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.1491424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.1491966Z     kernel = self.compile(
2025-05-07T20:32:58.1492562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.1493224Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.1493621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1493894Z 
2025-05-07T20:32:58.1494124Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505ac9120>
2025-05-07T20:32:58.1495253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.1496658Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505542d40>}
2025-05-07T20:32:58.1498132Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.1499174Z context = <triton._C.libtriton.ir.context object at 0x7fd505946bb0>
2025-05-07T20:32:58.1499474Z 
2025-05-07T20:32:58.1499641Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.1500167Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.1500641Z                            module_map=module_map)
2025-05-07T20:32:58.1501010Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.1501370Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.1501630Z E       ^
2025-05-07T20:32:58.1502100Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.1502560Z 
2025-05-07T20:32:58.1502983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.1503500Z 
2025-05-07T20:32:58.1503608Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.1504025Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.1504460Z     T=16384,
2025-05-07T20:32:58.1504682Z     D=5120,
2025-05-07T20:32:58.1504876Z     scale_ub=1200.0,
2025-05-07T20:32:58.1505106Z     contiguous=True,
2025-05-07T20:32:58.1505335Z     compiled=True,
2025-05-07T20:32:58.1505537Z )
2025-05-07T20:32:58.1505864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.1506409Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.1506689Z 
2025-05-07T20:32:58.1506772Z     @given(
2025-05-07T20:32:58.1507008Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.1507333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.1507647Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.1507977Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.1508310Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.1508603Z     )
2025-05-07T20:32:58.1508967Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.1509410Z     def test_silu_mul_quant(
2025-05-07T20:32:58.1509654Z         self,
2025-05-07T20:32:58.1509848Z         T: int,
2025-05-07T20:32:58.1510048Z         D: int,
2025-05-07T20:32:58.1510267Z         scale_ub: Optional[float],
2025-05-07T20:32:58.1510544Z         contiguous: bool,
2025-05-07T20:32:58.1510781Z         compiled: bool,
2025-05-07T20:32:58.1511007Z     ) -> None:
2025-05-07T20:32:58.1511228Z         torch.manual_seed(2025)
2025-05-07T20:32:58.1511471Z     
2025-05-07T20:32:58.1511753Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.1512104Z     
2025-05-07T20:32:58.1512304Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.1512644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.1512959Z         x = x_sign * x_clamp
2025-05-07T20:32:58.1513204Z         x0 = x[:, :D]
2025-05-07T20:32:58.1513417Z         x1 = x[:, D:]
2025-05-07T20:32:58.1513671Z     
2025-05-07T20:32:58.1513860Z         if contiguous:
2025-05-07T20:32:58.1514091Z             x0 = x0.contiguous()
2025-05-07T20:32:58.1514352Z             x1 = x1.contiguous()
2025-05-07T20:32:58.1514599Z     
2025-05-07T20:32:58.1514793Z         if scale_ub is not None:
2025-05-07T20:32:58.1515074Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.1515413Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.1515724Z             )
2025-05-07T20:32:58.1515923Z         else:
2025-05-07T20:32:58.1516137Z             scale_ub_tensor = None
2025-05-07T20:32:58.1516434Z     
2025-05-07T20:32:58.1516673Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.1516994Z             op = silu_mul_quant
2025-05-07T20:32:58.1517246Z             if compiled:
2025-05-07T20:32:58.1517491Z                 op = torch.compile(op)
2025-05-07T20:32:58.1517791Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1518073Z     
2025-05-07T20:32:58.1518267Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.1518436Z 
2025-05-07T20:32:58.1518539Z moe/activation_test.py:117: 
2025-05-07T20:32:58.1518837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1519162Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.1519452Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1520022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.1520588Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.1521256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.1521963Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.1522503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.1523185Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.1523856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.1524448Z     kernel = self.compile(
2025-05-07T20:32:58.1524999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.1525707Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.1526104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1526328Z 
2025-05-07T20:32:58.1526542Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5055e1630>
2025-05-07T20:32:58.1527636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.1529021Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505542830>}
2025-05-07T20:32:58.1530382Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.1531422Z context = <triton._C.libtriton.ir.context object at 0x7fd505983670>
2025-05-07T20:32:58.1531710Z 
2025-05-07T20:32:58.1531883Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.1532449Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.1532926Z                            module_map=module_map)
2025-05-07T20:32:58.1533295Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.1533654Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.1533983Z E       ^
2025-05-07T20:32:58.1534480Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.1534966Z 
2025-05-07T20:32:58.1535391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.1535906Z 
2025-05-07T20:32:58.3409400Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.3410656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.3411834Z     T=16384,
2025-05-07T20:32:58.3412390Z     D=5120,
2025-05-07T20:32:58.3413037Z     scale_ub=None,
2025-05-07T20:32:58.3413475Z     contiguous=False,
2025-05-07T20:32:58.3413839Z     compiled=True,
2025-05-07T20:32:58.3414049Z )
2025-05-07T20:32:58.3414377Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.3414887Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.3415175Z 
2025-05-07T20:32:58.3415264Z     @given(
2025-05-07T20:32:58.3415506Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.3415824Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.3416140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.3416483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.3416814Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.3417108Z     )
2025-05-07T20:32:58.3417468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.3417919Z     def test_silu_mul_quant(
2025-05-07T20:32:58.3418241Z         self,
2025-05-07T20:32:58.3418443Z         T: int,
2025-05-07T20:32:58.3418644Z         D: int,
2025-05-07T20:32:58.3418873Z         scale_ub: Optional[float],
2025-05-07T20:32:58.3419150Z         contiguous: bool,
2025-05-07T20:32:58.3419393Z         compiled: bool,
2025-05-07T20:32:58.3419633Z     ) -> None:
2025-05-07T20:32:58.3419858Z         torch.manual_seed(2025)
2025-05-07T20:32:58.3420101Z     
2025-05-07T20:32:58.3420384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.3420737Z     
2025-05-07T20:32:58.3420941Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.3421235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.3421550Z         x = x_sign * x_clamp
2025-05-07T20:32:58.3421866Z         x0 = x[:, :D]
2025-05-07T20:32:58.3422086Z         x1 = x[:, D:]
2025-05-07T20:32:58.3422299Z     
2025-05-07T20:32:58.3422492Z         if contiguous:
2025-05-07T20:32:58.3422730Z             x0 = x0.contiguous()
2025-05-07T20:32:58.3422996Z             x1 = x1.contiguous()
2025-05-07T20:32:58.3423245Z     
2025-05-07T20:32:58.3423441Z         if scale_ub is not None:
2025-05-07T20:32:58.3423723Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.3424067Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.3424380Z             )
2025-05-07T20:32:58.3424576Z         else:
2025-05-07T20:32:58.3424793Z             scale_ub_tensor = None
2025-05-07T20:32:58.3425045Z     
2025-05-07T20:32:58.3425290Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.3425609Z             op = silu_mul_quant
2025-05-07T20:32:58.3425865Z             if compiled:
2025-05-07T20:32:58.3426114Z                 op = torch.compile(op)
2025-05-07T20:32:58.3426426Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.3426708Z     
2025-05-07T20:32:58.3426908Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.3427082Z 
2025-05-07T20:32:58.3427187Z moe/activation_test.py:117: 
2025-05-07T20:32:58.3427558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.3427891Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.3428183Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.3428759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.3429392Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.3430057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.3430760Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.3431310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.3432002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.3432683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.3433271Z     kernel = self.compile(
2025-05-07T20:32:58.3433827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.3434490Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.3434897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.3435123Z 
2025-05-07T20:32:58.3435340Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505589270>
2025-05-07T20:32:58.3436448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.3437837Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505543760>}
2025-05-07T20:32:58.3439205Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.3440253Z context = <triton._C.libtriton.ir.context object at 0x7fd505499f70>
2025-05-07T20:32:58.3440545Z 
2025-05-07T20:32:58.3440721Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.3441249Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.3441730Z                            module_map=module_map)
2025-05-07T20:32:58.3442105Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.3442516Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.3442777Z E       ^
2025-05-07T20:32:58.3443255Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.3443719Z 
2025-05-07T20:32:58.3444151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.3444669Z 
2025-05-07T20:32:58.3444784Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.3445207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.3445615Z     T=2048,
2025-05-07T20:32:58.3445814Z     D=5120,
2025-05-07T20:32:58.3446008Z     scale_ub=None,
2025-05-07T20:32:58.3446230Z     contiguous=False,
2025-05-07T20:32:58.3446462Z     compiled=True,
2025-05-07T20:32:58.3446667Z )
2025-05-07T20:32:58.4473111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.4473900Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.4474294Z 
2025-05-07T20:32:58.4474419Z     @given(
2025-05-07T20:32:58.4474743Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.4475188Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.4475664Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.4476008Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.4476339Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.4476689Z     )
2025-05-07T20:32:58.4477049Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.4477495Z     def test_silu_mul_quant(
2025-05-07T20:32:58.4477742Z         self,
2025-05-07T20:32:58.4477942Z         T: int,
2025-05-07T20:32:58.4478138Z         D: int,
2025-05-07T20:32:58.4478361Z         scale_ub: Optional[float],
2025-05-07T20:32:58.4478640Z         contiguous: bool,
2025-05-07T20:32:58.4478883Z         compiled: bool,
2025-05-07T20:32:58.4479118Z     ) -> None:
2025-05-07T20:32:58.4479348Z         torch.manual_seed(2025)
2025-05-07T20:32:58.4479587Z     
2025-05-07T20:32:58.4479938Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.4480293Z     
2025-05-07T20:32:58.4480498Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.4480795Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.4481113Z         x = x_sign * x_clamp
2025-05-07T20:32:58.4481359Z         x0 = x[:, :D]
2025-05-07T20:32:58.4481577Z         x1 = x[:, D:]
2025-05-07T20:32:58.4481790Z     
2025-05-07T20:32:58.4481979Z         if contiguous:
2025-05-07T20:32:58.4482216Z             x0 = x0.contiguous()
2025-05-07T20:32:58.4482485Z             x1 = x1.contiguous()
2025-05-07T20:32:58.4482733Z     
2025-05-07T20:32:58.4482923Z         if scale_ub is not None:
2025-05-07T20:32:58.4483204Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.4489786Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.4490132Z             )
2025-05-07T20:32:58.4490335Z         else:
2025-05-07T20:32:58.4490550Z             scale_ub_tensor = None
2025-05-07T20:32:58.4490801Z     
2025-05-07T20:32:58.4491042Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.4491359Z             op = silu_mul_quant
2025-05-07T20:32:58.4491605Z             if compiled:
2025-05-07T20:32:58.4491856Z                 op = torch.compile(op)
2025-05-07T20:32:58.4492160Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.4492424Z     
2025-05-07T20:32:58.4492622Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.4492787Z 
2025-05-07T20:32:58.4492892Z moe/activation_test.py:117: 
2025-05-07T20:32:58.4493191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.4493520Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.4493808Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.4494479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.4495050Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.4495721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.4496417Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.4496957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.4497641Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.4498394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.4498931Z     kernel = self.compile(
2025-05-07T20:32:58.4499475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.4500142Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.4500541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.4500766Z 
2025-05-07T20:32:58.4501028Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5059ec640>
2025-05-07T20:32:58.4502119Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.4503551Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5054cc3a0>}
2025-05-07T20:32:58.4504960Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.4505994Z context = <triton._C.libtriton.ir.context object at 0x7fd505488c30>
2025-05-07T20:32:58.4506281Z 
2025-05-07T20:32:58.4506501Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.4507023Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.4507500Z                            module_map=module_map)
2025-05-07T20:32:58.4507869Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.4508224Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.4508489Z E       ^
2025-05-07T20:32:58.4508965Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.4509418Z 
2025-05-07T20:32:58.4509844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.4510361Z 
2025-05-07T20:32:58.4510466Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.4510883Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.4511291Z     T=2048,
2025-05-07T20:32:58.4511474Z     D=5120,
2025-05-07T20:32:58.4511673Z     scale_ub=1200.0,
2025-05-07T20:32:58.4511924Z     contiguous=False,
2025-05-07T20:32:58.4512150Z     compiled=True,
2025-05-07T20:32:58.4512350Z )
2025-05-07T20:32:58.4512674Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.4513176Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.4513449Z 
2025-05-07T20:32:58.4513532Z     @given(
2025-05-07T20:32:58.4513759Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.4514073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.4514407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.4514761Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.4515141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.4515433Z     )
2025-05-07T20:32:58.4515785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.4516228Z     def test_silu_mul_quant(
2025-05-07T20:32:58.4516467Z         self,
2025-05-07T20:32:58.4516661Z         T: int,
2025-05-07T20:32:58.4516856Z         D: int,
2025-05-07T20:32:58.4517069Z         scale_ub: Optional[float],
2025-05-07T20:32:58.4517341Z         contiguous: bool,
2025-05-07T20:32:58.4517580Z         compiled: bool,
2025-05-07T20:32:58.4517802Z     ) -> None:
2025-05-07T20:32:58.4518012Z         torch.manual_seed(2025)
2025-05-07T20:32:58.4518254Z     
2025-05-07T20:32:58.4518526Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.4518862Z     
2025-05-07T20:32:58.4519056Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.4519349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.4519659Z         x = x_sign * x_clamp
2025-05-07T20:32:58.4519896Z         x0 = x[:, :D]
2025-05-07T20:32:58.4520109Z         x1 = x[:, D:]
2025-05-07T20:32:58.4520310Z     
2025-05-07T20:32:58.4520500Z         if contiguous:
2025-05-07T20:32:58.4520731Z             x0 = x0.contiguous()
2025-05-07T20:32:58.4521031Z             x1 = x1.contiguous()
2025-05-07T20:32:58.4521268Z     
2025-05-07T20:32:58.4521457Z         if scale_ub is not None:
2025-05-07T20:32:58.4521724Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.4522101Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.4522404Z             )
2025-05-07T20:32:58.4522590Z         else:
2025-05-07T20:32:58.4522804Z             scale_ub_tensor = None
2025-05-07T20:32:58.4523053Z     
2025-05-07T20:32:58.4523284Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.4523592Z             op = silu_mul_quant
2025-05-07T20:32:58.4523840Z             if compiled:
2025-05-07T20:32:58.4524089Z                 op = torch.compile(op)
2025-05-07T20:32:58.4524383Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.4524653Z     
2025-05-07T20:32:58.4524891Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.4525054Z 
2025-05-07T20:32:58.4525153Z moe/activation_test.py:117: 
2025-05-07T20:32:58.4525446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.4525770Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.4526046Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.4526610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.4527170Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.4527836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.4528522Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.4529064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.4529755Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.4530432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.4530963Z     kernel = self.compile(
2025-05-07T20:32:58.4531513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.4532173Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.4532564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.4532795Z 
2025-05-07T20:32:58.4533003Z self = <triton.compiler.compiler.ASTSource object at 0x7fd50597bfa0>
2025-05-07T20:32:58.4534147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.4535539Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5054cc820>}
2025-05-07T20:32:58.4536894Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.4537926Z context = <triton._C.libtriton.ir.context object at 0x7fd505334b30>
2025-05-07T20:32:58.4538269Z 
2025-05-07T20:32:58.4538435Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.4538961Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.4539432Z                            module_map=module_map)
2025-05-07T20:32:58.4539796Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.4540152Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.4540407Z E       ^
2025-05-07T20:32:58.4540919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.4541377Z 
2025-05-07T20:32:58.4541795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.4542355Z 
2025-05-07T20:32:58.8134193Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.8134645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8135201Z     T=4096,
2025-05-07T20:32:58.8135475Z     D=5120,
2025-05-07T20:32:58.8135750Z     scale_ub=1200.0,
2025-05-07T20:32:58.8136065Z     contiguous=True,
2025-05-07T20:32:58.8136367Z     compiled=True,
2025-05-07T20:32:58.8136581Z )
2025-05-07T20:32:58.8136912Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.8137420Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.8137810Z 
2025-05-07T20:32:58.8137896Z     @given(
2025-05-07T20:32:58.8138242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.8138561Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.8138876Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.8139214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.8139548Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.8139840Z     )
2025-05-07T20:32:58.8140199Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.8140648Z     def test_silu_mul_quant(
2025-05-07T20:32:58.8140895Z         self,
2025-05-07T20:32:58.8141097Z         T: int,
2025-05-07T20:32:58.8141294Z         D: int,
2025-05-07T20:32:58.8141517Z         scale_ub: Optional[float],
2025-05-07T20:32:58.8141799Z         contiguous: bool,
2025-05-07T20:32:58.8142047Z         compiled: bool,
2025-05-07T20:32:58.8142282Z     ) -> None:
2025-05-07T20:32:58.8142509Z         torch.manual_seed(2025)
2025-05-07T20:32:58.8142762Z     
2025-05-07T20:32:58.8143040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.8143387Z     
2025-05-07T20:32:58.8143588Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.8143887Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.8144206Z         x = x_sign * x_clamp
2025-05-07T20:32:58.8144455Z         x0 = x[:, :D]
2025-05-07T20:32:58.8144672Z         x1 = x[:, D:]
2025-05-07T20:32:58.8144887Z     
2025-05-07T20:32:58.8145082Z         if contiguous:
2025-05-07T20:32:58.8145319Z             x0 = x0.contiguous()
2025-05-07T20:32:58.8145586Z             x1 = x1.contiguous()
2025-05-07T20:32:58.8145831Z     
2025-05-07T20:32:58.8146025Z         if scale_ub is not None:
2025-05-07T20:32:58.8146382Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.8146732Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.8147042Z             )
2025-05-07T20:32:58.8147245Z         else:
2025-05-07T20:32:58.8147463Z             scale_ub_tensor = None
2025-05-07T20:32:58.8147716Z     
2025-05-07T20:32:58.8147964Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8148285Z             op = silu_mul_quant
2025-05-07T20:32:58.8148540Z             if compiled:
2025-05-07T20:32:58.8148792Z                 op = torch.compile(op)
2025-05-07T20:32:58.8149095Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.8149373Z     
2025-05-07T20:32:58.8149569Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.8149741Z 
2025-05-07T20:32:58.8149843Z moe/activation_test.py:117: 
2025-05-07T20:32:58.8150141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8150471Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.8150767Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.8151342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.8151921Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.8152658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.8153367Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.8153918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.8154700Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.8155371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.8156170Z     kernel = self.compile(
2025-05-07T20:32:58.8156731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.8157403Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.8157888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8158124Z 
2025-05-07T20:32:58.8158340Z self = <triton.compiler.compiler.ASTSource object at 0x7fd50540cca0>
2025-05-07T20:32:58.8159444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.8160852Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5054cd360>}
2025-05-07T20:32:58.8162219Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.8163267Z context = <triton._C.libtriton.ir.context object at 0x7fd50536e4f0>
2025-05-07T20:32:58.8163568Z 
2025-05-07T20:32:58.8163740Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.8164299Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.8164798Z                            module_map=module_map)
2025-05-07T20:32:58.8165175Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.8165538Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.8165798Z E       ^
2025-05-07T20:32:58.8166272Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.8166735Z 
2025-05-07T20:32:58.8167239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.8167762Z 
2025-05-07T20:32:58.8167875Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.8168300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8168708Z     T=128,
2025-05-07T20:32:58.8168903Z     D=5120,
2025-05-07T20:32:58.8169103Z     scale_ub=1200.0,
2025-05-07T20:32:58.8169332Z     contiguous=False,
2025-05-07T20:32:58.8169567Z     compiled=True,
2025-05-07T20:32:58.8169778Z )
2025-05-07T20:32:58.9315689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9316460Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.9316873Z 
2025-05-07T20:32:58.9316983Z     @given(
2025-05-07T20:32:58.9317316Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9317758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9318084Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9318430Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9318766Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9319060Z     )
2025-05-07T20:32:58.9319422Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9319980Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9320230Z         self,
2025-05-07T20:32:58.9320435Z         T: int,
2025-05-07T20:32:58.9320636Z         D: int,
2025-05-07T20:32:58.9320858Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9321205Z         contiguous: bool,
2025-05-07T20:32:58.9321452Z         compiled: bool,
2025-05-07T20:32:58.9321678Z     ) -> None:
2025-05-07T20:32:58.9321907Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9322160Z     
2025-05-07T20:32:58.9322436Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9322787Z     
2025-05-07T20:32:58.9322986Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9323281Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9323598Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9323842Z         x0 = x[:, :D]
2025-05-07T20:32:58.9324131Z         x1 = x[:, D:]
2025-05-07T20:32:58.9324355Z     
2025-05-07T20:32:58.9324581Z         if contiguous:
2025-05-07T20:32:58.9324820Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9325082Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9325331Z     
2025-05-07T20:32:58.9325534Z         if scale_ub is not None:
2025-05-07T20:32:58.9325817Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9326164Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9326474Z             )
2025-05-07T20:32:58.9326668Z         else:
2025-05-07T20:32:58.9326884Z             scale_ub_tensor = None
2025-05-07T20:32:58.9327144Z     
2025-05-07T20:32:58.9327382Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9327701Z             op = silu_mul_quant
2025-05-07T20:32:58.9327957Z             if compiled:
2025-05-07T20:32:58.9328205Z                 op = torch.compile(op)
2025-05-07T20:32:58.9328512Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9328797Z     
2025-05-07T20:32:58.9328998Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9329171Z 
2025-05-07T20:32:58.9329277Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9329578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9329915Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9330208Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9330786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9331359Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9332023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9332794Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9333343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9334042Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9334766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9335308Z     kernel = self.compile(
2025-05-07T20:32:58.9335863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9336539Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9336939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9337175Z 
2025-05-07T20:32:58.9337387Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5054e8d90>
2025-05-07T20:32:58.9338588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9340039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5054ce290>}
2025-05-07T20:32:58.9341397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9342481Z context = <triton._C.libtriton.ir.context object at 0x7fd505330d70>
2025-05-07T20:32:58.9342779Z 
2025-05-07T20:32:58.9342950Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9343490Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9343971Z                            module_map=module_map)
2025-05-07T20:32:58.9344343Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9344710Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9345015Z E       ^
2025-05-07T20:32:58.9345493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9345957Z 
2025-05-07T20:32:58.9346383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9346906Z 
2025-05-07T20:32:58.9347023Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.9347442Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.9347853Z     T=16384,
2025-05-07T20:32:58.9348056Z     D=7168,
2025-05-07T20:32:58.9348255Z     scale_ub=1200.0,
2025-05-07T20:32:58.9348485Z     contiguous=True,
2025-05-07T20:32:58.9348716Z     compiled=True,
2025-05-07T20:32:58.9348930Z )
2025-05-07T20:32:58.9349255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.9349765Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:58.9350049Z 
2025-05-07T20:32:58.9350138Z     @given(
2025-05-07T20:32:58.9350376Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.9350696Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.9351013Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.9351349Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.9351689Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.9351980Z     )
2025-05-07T20:32:58.9352365Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.9352815Z     def test_silu_mul_quant(
2025-05-07T20:32:58.9353058Z         self,
2025-05-07T20:32:58.9353257Z         T: int,
2025-05-07T20:32:58.9353457Z         D: int,
2025-05-07T20:32:58.9353726Z         scale_ub: Optional[float],
2025-05-07T20:32:58.9354013Z         contiguous: bool,
2025-05-07T20:32:58.9354262Z         compiled: bool,
2025-05-07T20:32:58.9354490Z     ) -> None:
2025-05-07T20:32:58.9354712Z         torch.manual_seed(2025)
2025-05-07T20:32:58.9354963Z     
2025-05-07T20:32:58.9355242Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.9355847Z     
2025-05-07T20:32:58.9356050Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.9356351Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.9356667Z         x = x_sign * x_clamp
2025-05-07T20:32:58.9356915Z         x0 = x[:, :D]
2025-05-07T20:32:58.9357132Z         x1 = x[:, D:]
2025-05-07T20:32:58.9357345Z     
2025-05-07T20:32:58.9357536Z         if contiguous:
2025-05-07T20:32:58.9357769Z             x0 = x0.contiguous()
2025-05-07T20:32:58.9358038Z             x1 = x1.contiguous()
2025-05-07T20:32:58.9358283Z     
2025-05-07T20:32:58.9358481Z         if scale_ub is not None:
2025-05-07T20:32:58.9358761Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.9359104Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.9359423Z             )
2025-05-07T20:32:58.9359618Z         else:
2025-05-07T20:32:58.9359909Z             scale_ub_tensor = None
2025-05-07T20:32:58.9360172Z     
2025-05-07T20:32:58.9360409Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.9360733Z             op = silu_mul_quant
2025-05-07T20:32:58.9361056Z             if compiled:
2025-05-07T20:32:58.9361308Z                 op = torch.compile(op)
2025-05-07T20:32:58.9361619Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9361898Z     
2025-05-07T20:32:58.9362093Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.9362268Z 
2025-05-07T20:32:58.9362369Z moe/activation_test.py:117: 
2025-05-07T20:32:58.9362670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9363003Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.9363288Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.9363860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.9364504Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.9365172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.9365877Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.9366424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.9367117Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.9367787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.9368326Z     kernel = self.compile(
2025-05-07T20:32:58.9368887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.9369550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.9369954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.9370187Z 
2025-05-07T20:32:58.9370397Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5053f1510>
2025-05-07T20:32:58.9371492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.9372892Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5054ced40>}
2025-05-07T20:32:58.9374364Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.9375410Z context = <triton._C.libtriton.ir.context object at 0x7fd50522f670>
2025-05-07T20:32:58.9375702Z 
2025-05-07T20:32:58.9375881Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.9376413Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.9376888Z                            module_map=module_map)
2025-05-07T20:32:58.9377263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.9377633Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.9377897Z E       ^
2025-05-07T20:32:58.9378456Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.9378914Z 
2025-05-07T20:32:58.9379344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.9379862Z 
2025-05-07T20:32:59.0743183Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.0744518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.0745404Z     T=16384,
2025-05-07T20:32:59.0745640Z     D=5120,
2025-05-07T20:32:59.0745831Z     scale_ub=1200.0,
2025-05-07T20:32:59.0746050Z     contiguous=True,
2025-05-07T20:32:59.0746272Z     compiled=False,
2025-05-07T20:32:59.0746552Z )
2025-05-07T20:32:59.0746872Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.0747376Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:59.0747666Z 
2025-05-07T20:32:59.0747742Z     @given(
2025-05-07T20:32:59.0747971Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.0748279Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.0748597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.0748935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.0749257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.0749618Z     )
2025-05-07T20:32:59.0749978Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.0750415Z     def test_silu_mul_quant(
2025-05-07T20:32:59.0750652Z         self,
2025-05-07T20:32:59.0750850Z         T: int,
2025-05-07T20:32:59.0751048Z         D: int,
2025-05-07T20:32:59.0751268Z         scale_ub: Optional[float],
2025-05-07T20:32:59.0751537Z         contiguous: bool,
2025-05-07T20:32:59.0751782Z         compiled: bool,
2025-05-07T20:32:59.0752001Z     ) -> None:
2025-05-07T20:32:59.0752219Z         torch.manual_seed(2025)
2025-05-07T20:32:59.0752464Z     
2025-05-07T20:32:59.0752734Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.0753078Z     
2025-05-07T20:32:59.0753276Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.0753565Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.0753875Z         x = x_sign * x_clamp
2025-05-07T20:32:59.0754120Z         x0 = x[:, :D]
2025-05-07T20:32:59.0754331Z         x1 = x[:, D:]
2025-05-07T20:32:59.0754539Z     
2025-05-07T20:32:59.0754729Z         if contiguous:
2025-05-07T20:32:59.0754955Z             x0 = x0.contiguous()
2025-05-07T20:32:59.0755210Z             x1 = x1.contiguous()
2025-05-07T20:32:59.0755448Z     
2025-05-07T20:32:59.0755921Z         if scale_ub is not None:
2025-05-07T20:32:59.0756197Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.0756534Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.0756838Z             )
2025-05-07T20:32:59.0757030Z         else:
2025-05-07T20:32:59.0757243Z             scale_ub_tensor = None
2025-05-07T20:32:59.0757495Z     
2025-05-07T20:32:59.0757726Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.0758117Z             op = silu_mul_quant
2025-05-07T20:32:59.0758370Z             if compiled:
2025-05-07T20:32:59.0758616Z                 op = torch.compile(op)
2025-05-07T20:32:59.0758918Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.0759194Z     
2025-05-07T20:32:59.0759384Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.0759555Z 
2025-05-07T20:32:59.0759655Z moe/activation_test.py:117: 
2025-05-07T20:32:59.0759950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.0760274Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.0760556Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.0761255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.0761957Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.0762500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.0763189Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.0763859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.0764482Z     kernel = self.compile(
2025-05-07T20:32:59.0765048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.0765710Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.0766166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.0766392Z 
2025-05-07T20:32:59.0766600Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5053f1360>
2025-05-07T20:32:59.0767690Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.0769085Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5054cfac0>}
2025-05-07T20:32:59.0770518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.0771554Z context = <triton._C.libtriton.ir.context object at 0x7fd5052b4bf0>
2025-05-07T20:32:59.0771841Z 
2025-05-07T20:32:59.0772012Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.0772533Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.0773008Z                            module_map=module_map)
2025-05-07T20:32:59.0773381Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.0773731Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.0773992Z E       ^
2025-05-07T20:32:59.0774460Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.0774967Z 
2025-05-07T20:32:59.0775394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.0775906Z 
2025-05-07T20:32:59.0776009Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.0776427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.0776823Z     T=1,
2025-05-07T20:32:59.0777001Z     D=7168,
2025-05-07T20:32:59.0777196Z     scale_ub=1200.0,
2025-05-07T20:32:59.0777424Z     contiguous=False,
2025-05-07T20:32:59.0777648Z     compiled=False,
2025-05-07T20:32:59.0777847Z )
2025-05-07T20:32:59.0778250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.0778814Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:59.0779081Z 
2025-05-07T20:32:59.0779162Z     @given(
2025-05-07T20:32:59.0779390Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.0779704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.0780008Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.0780332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.0780660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.0780951Z     )
2025-05-07T20:32:59.0781294Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.0781735Z     def test_silu_mul_quant(
2025-05-07T20:32:59.0781975Z         self,
2025-05-07T20:32:59.0782164Z         T: int,
2025-05-07T20:32:59.0782353Z         D: int,
2025-05-07T20:32:59.0782571Z         scale_ub: Optional[float],
2025-05-07T20:32:59.0782844Z         contiguous: bool,
2025-05-07T20:32:59.0783083Z         compiled: bool,
2025-05-07T20:32:59.0783305Z     ) -> None:
2025-05-07T20:32:59.0783518Z         torch.manual_seed(2025)
2025-05-07T20:32:59.0783754Z     
2025-05-07T20:32:59.0784031Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.0784421Z     
2025-05-07T20:32:59.0784612Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.0784904Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.0785209Z         x = x_sign * x_clamp
2025-05-07T20:32:59.0785481Z         x0 = x[:, :D]
2025-05-07T20:32:59.0785696Z         x1 = x[:, D:]
2025-05-07T20:32:59.0785903Z     
2025-05-07T20:32:59.0786082Z         if contiguous:
2025-05-07T20:32:59.0786313Z             x0 = x0.contiguous()
2025-05-07T20:32:59.0786567Z             x1 = x1.contiguous()
2025-05-07T20:32:59.0786799Z     
2025-05-07T20:32:59.0786988Z         if scale_ub is not None:
2025-05-07T20:32:59.0787259Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.0787597Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.0787899Z             )
2025-05-07T20:32:59.0788093Z         else:
2025-05-07T20:32:59.0788397Z             scale_ub_tensor = None
2025-05-07T20:32:59.0788648Z     
2025-05-07T20:32:59.0788883Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.0789197Z             op = silu_mul_quant
2025-05-07T20:32:59.0789438Z             if compiled:
2025-05-07T20:32:59.0789680Z                 op = torch.compile(op)
2025-05-07T20:32:59.0789981Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.0790249Z     
2025-05-07T20:32:59.0790443Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.0790610Z 
2025-05-07T20:32:59.0790713Z moe/activation_test.py:117: 
2025-05-07T20:32:59.0791003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.0791331Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.0791607Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.0792301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.0792993Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.0793535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.0794222Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.0794941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.0795467Z     kernel = self.compile(
2025-05-07T20:32:59.0796013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.0796676Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.0797113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.0797340Z 
2025-05-07T20:32:59.0797546Z self = <triton.compiler.compiler.ASTSource object at 0x7fd50502d840>
2025-05-07T20:32:59.0798634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.0800022Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5050484c0>}
2025-05-07T20:32:59.0801379Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.0802408Z context = <triton._C.libtriton.ir.context object at 0x7fd5050ec630>
2025-05-07T20:32:59.0802701Z 
2025-05-07T20:32:59.0802872Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.0803396Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.0803870Z                            module_map=module_map)
2025-05-07T20:32:59.0804278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.0804641Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.0804943Z E       ^
2025-05-07T20:32:59.0805406Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.0805903Z 
2025-05-07T20:32:59.0806323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.0806840Z 
2025-05-07T20:32:59.2714314Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.2715123Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.2715732Z     T=4096,
2025-05-07T20:32:59.2715992Z     D=7168,
2025-05-07T20:32:59.2716270Z     scale_ub=1200.0,
2025-05-07T20:32:59.2716507Z     contiguous=False,
2025-05-07T20:32:59.2716882Z     compiled=True,
2025-05-07T20:32:59.2717089Z )
2025-05-07T20:32:59.2717418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.2717918Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:59.2718199Z 
2025-05-07T20:32:59.2718275Z     @given(
2025-05-07T20:32:59.2718513Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.2718828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.2719134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.2719471Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.2719805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.2720084Z     )
2025-05-07T20:32:59.2720447Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.2720884Z     def test_silu_mul_quant(
2025-05-07T20:32:59.2721122Z         self,
2025-05-07T20:32:59.2721317Z         T: int,
2025-05-07T20:32:59.2721515Z         D: int,
2025-05-07T20:32:59.2721730Z         scale_ub: Optional[float],
2025-05-07T20:32:59.2722012Z         contiguous: bool,
2025-05-07T20:32:59.2722253Z         compiled: bool,
2025-05-07T20:32:59.2722481Z     ) -> None:
2025-05-07T20:32:59.2722699Z         torch.manual_seed(2025)
2025-05-07T20:32:59.2722943Z     
2025-05-07T20:32:59.2723220Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.2723559Z     
2025-05-07T20:32:59.2723754Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.2724046Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.2724354Z         x = x_sign * x_clamp
2025-05-07T20:32:59.2724587Z         x0 = x[:, :D]
2025-05-07T20:32:59.2724806Z         x1 = x[:, D:]
2025-05-07T20:32:59.2725016Z     
2025-05-07T20:32:59.2725273Z         if contiguous:
2025-05-07T20:32:59.2725514Z             x0 = x0.contiguous()
2025-05-07T20:32:59.2725773Z             x1 = x1.contiguous()
2025-05-07T20:32:59.2726012Z     
2025-05-07T20:32:59.2726213Z         if scale_ub is not None:
2025-05-07T20:32:59.2726493Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.2726826Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.2727136Z             )
2025-05-07T20:32:59.2727329Z         else:
2025-05-07T20:32:59.2727538Z             scale_ub_tensor = None
2025-05-07T20:32:59.2727788Z     
2025-05-07T20:32:59.2728024Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.2728333Z             op = silu_mul_quant
2025-05-07T20:32:59.2728584Z             if compiled:
2025-05-07T20:32:59.2728831Z                 op = torch.compile(op)
2025-05-07T20:32:59.2729129Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.2729399Z     
2025-05-07T20:32:59.2729600Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.2729765Z 
2025-05-07T20:32:59.2729873Z moe/activation_test.py:117: 
2025-05-07T20:32:59.2730165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.2730504Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.2730857Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.2731416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.2731976Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.2732705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.2733401Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.2733937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.2734678Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.2735352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.2735925Z     kernel = self.compile(
2025-05-07T20:32:59.2736472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.2737138Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.2737531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.2737757Z 
2025-05-07T20:32:59.2737965Z self = <triton.compiler.compiler.ASTSource object at 0x7fd50502efe0>
2025-05-07T20:32:59.2739155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.2740548Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5050491b0>}
2025-05-07T20:32:59.2741912Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.2742946Z context = <triton._C.libtriton.ir.context object at 0x7fd505060c30>
2025-05-07T20:32:59.2743236Z 
2025-05-07T20:32:59.2743411Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.2743937Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.2744412Z                            module_map=module_map)
2025-05-07T20:32:59.2744828Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.2745190Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.2745450Z E       ^
2025-05-07T20:32:59.2745968Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.2746422Z 
2025-05-07T20:32:59.2746848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.2747370Z 
2025-05-07T20:32:59.2747475Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.2747892Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.2748292Z     T=128,
2025-05-07T20:32:59.2748479Z     D=7168,
2025-05-07T20:32:59.2748673Z     scale_ub=1200.0,
2025-05-07T20:32:59.2748899Z     contiguous=False,
2025-05-07T20:32:59.2749120Z     compiled=True,
2025-05-07T20:32:59.2749328Z )
2025-05-07T20:32:59.3778887Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.3779687Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:59.3780077Z 
2025-05-07T20:32:59.3780198Z     @given(
2025-05-07T20:32:59.3780453Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.3780768Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.3781087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.3781542Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.3781877Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.3782167Z     )
2025-05-07T20:32:59.3782524Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.3783035Z     def test_silu_mul_quant(
2025-05-07T20:32:59.3783276Z         self,
2025-05-07T20:32:59.3783479Z         T: int,
2025-05-07T20:32:59.3783679Z         D: int,
2025-05-07T20:32:59.3783898Z         scale_ub: Optional[float],
2025-05-07T20:32:59.3784177Z         contiguous: bool,
2025-05-07T20:32:59.3784423Z         compiled: bool,
2025-05-07T20:32:59.3784649Z     ) -> None:
2025-05-07T20:32:59.3784872Z         torch.manual_seed(2025)
2025-05-07T20:32:59.3785119Z     
2025-05-07T20:32:59.3785395Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.3785822Z     
2025-05-07T20:32:59.3786023Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.3786320Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.3786637Z         x = x_sign * x_clamp
2025-05-07T20:32:59.3786881Z         x0 = x[:, :D]
2025-05-07T20:32:59.3787097Z         x1 = x[:, D:]
2025-05-07T20:32:59.3787311Z     
2025-05-07T20:32:59.3787503Z         if contiguous:
2025-05-07T20:32:59.3787734Z             x0 = x0.contiguous()
2025-05-07T20:32:59.3787999Z             x1 = x1.contiguous()
2025-05-07T20:32:59.3788245Z     
2025-05-07T20:32:59.3788439Z         if scale_ub is not None:
2025-05-07T20:32:59.3788713Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.3789063Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.3789375Z             )
2025-05-07T20:32:59.3789568Z         else:
2025-05-07T20:32:59.3789783Z             scale_ub_tensor = None
2025-05-07T20:32:59.3790038Z     
2025-05-07T20:32:59.3790272Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.3790596Z             op = silu_mul_quant
2025-05-07T20:32:59.3790856Z             if compiled:
2025-05-07T20:32:59.3791103Z                 op = torch.compile(op)
2025-05-07T20:32:59.3791406Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3791684Z     
2025-05-07T20:32:59.3791883Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.3792055Z 
2025-05-07T20:32:59.3792155Z moe/activation_test.py:117: 
2025-05-07T20:32:59.3792453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3792786Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.3793069Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3793712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.3794286Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.3795011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.3795716Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.3796266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.3796956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.3797628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.3798168Z     kernel = self.compile(
2025-05-07T20:32:59.3798720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.3799385Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.3799795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3800029Z 
2025-05-07T20:32:59.3800239Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505035f90>
2025-05-07T20:32:59.3801390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.3802795Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5050480d0>}
2025-05-07T20:32:59.3804197Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.3805295Z context = <triton._C.libtriton.ir.context object at 0x7fd504f9f670>
2025-05-07T20:32:59.3805590Z 
2025-05-07T20:32:59.3805760Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.3806292Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.3806810Z                            module_map=module_map)
2025-05-07T20:32:59.3807186Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.3807549Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.3807807Z E       ^
2025-05-07T20:32:59.3808287Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.3808754Z 
2025-05-07T20:32:59.3809177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.3809697Z 
2025-05-07T20:32:59.3809809Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.3810230Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.3810645Z     T=2048,
2025-05-07T20:32:59.3810846Z     D=7168,
2025-05-07T20:32:59.3811041Z     scale_ub=None,
2025-05-07T20:32:59.3811269Z     contiguous=True,
2025-05-07T20:32:59.3811503Z     compiled=True,
2025-05-07T20:32:59.3811714Z )
2025-05-07T20:32:59.3812047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.3812547Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:59.3812822Z 
2025-05-07T20:32:59.3812905Z     @given(
2025-05-07T20:32:59.3813136Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.3813454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.3813767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.3814096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.3814432Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.3814724Z     )
2025-05-07T20:32:59.3815122Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.3815575Z     def test_silu_mul_quant(
2025-05-07T20:32:59.3815822Z         self,
2025-05-07T20:32:59.3816018Z         T: int,
2025-05-07T20:32:59.3816213Z         D: int,
2025-05-07T20:32:59.3816438Z         scale_ub: Optional[float],
2025-05-07T20:32:59.3816718Z         contiguous: bool,
2025-05-07T20:32:59.3816955Z         compiled: bool,
2025-05-07T20:32:59.3817182Z     ) -> None:
2025-05-07T20:32:59.3817406Z         torch.manual_seed(2025)
2025-05-07T20:32:59.3817644Z     
2025-05-07T20:32:59.3817924Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.3818397Z     
2025-05-07T20:32:59.3818590Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.3818884Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.3819197Z         x = x_sign * x_clamp
2025-05-07T20:32:59.3819436Z         x0 = x[:, :D]
2025-05-07T20:32:59.3819660Z         x1 = x[:, D:]
2025-05-07T20:32:59.3819875Z     
2025-05-07T20:32:59.3820062Z         if contiguous:
2025-05-07T20:32:59.3820295Z             x0 = x0.contiguous()
2025-05-07T20:32:59.3820560Z             x1 = x1.contiguous()
2025-05-07T20:32:59.3826687Z     
2025-05-07T20:32:59.3826981Z         if scale_ub is not None:
2025-05-07T20:32:59.3827275Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.3827620Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.3827931Z             )
2025-05-07T20:32:59.3828167Z         else:
2025-05-07T20:32:59.3828382Z             scale_ub_tensor = None
2025-05-07T20:32:59.3828638Z     
2025-05-07T20:32:59.3828882Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.3829204Z             op = silu_mul_quant
2025-05-07T20:32:59.3829462Z             if compiled:
2025-05-07T20:32:59.3829714Z                 op = torch.compile(op)
2025-05-07T20:32:59.3830013Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3830300Z     
2025-05-07T20:32:59.3830503Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.3830676Z 
2025-05-07T20:32:59.3830780Z moe/activation_test.py:117: 
2025-05-07T20:32:59.3831126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3831470Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.3831755Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3832329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.3832901Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.3833568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.3834274Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.3834863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.3835570Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.3836240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.3836786Z     kernel = self.compile(
2025-05-07T20:32:59.3837343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.3838017Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.3838419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3838654Z 
2025-05-07T20:32:59.3838867Z self = <triton.compiler.compiler.ASTSource object at 0x7fd504f24130>
2025-05-07T20:32:59.3840013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.3841420Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd50504a560>}
2025-05-07T20:32:59.3842790Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.3843835Z context = <triton._C.libtriton.ir.context object at 0x7fd504fce1f0>
2025-05-07T20:32:59.3844137Z 
2025-05-07T20:32:59.3844309Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.3844841Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.3845322Z                            module_map=module_map)
2025-05-07T20:32:59.3845701Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.3846072Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.3846331Z E       ^
2025-05-07T20:32:59.3846809Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.3847276Z 
2025-05-07T20:32:59.3847749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.3848269Z 
2025-05-07T20:32:59.4698555Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.4699179Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.4699966Z     T=16384,
2025-05-07T20:32:59.4700240Z     D=5120,
2025-05-07T20:32:59.4700513Z     scale_ub=None,
2025-05-07T20:32:59.4700824Z     contiguous=False,
2025-05-07T20:32:59.4701057Z     compiled=False,
2025-05-07T20:32:59.4701275Z )
2025-05-07T20:32:59.4701606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.4702121Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:59.4702401Z 
2025-05-07T20:32:59.4702480Z     @given(
2025-05-07T20:32:59.4702716Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.4703119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.4703437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.4703767Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.4704103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.4704389Z     )
2025-05-07T20:32:59.4704793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.4705244Z     def test_silu_mul_quant(
2025-05-07T20:32:59.4705483Z         self,
2025-05-07T20:32:59.4705676Z         T: int,
2025-05-07T20:32:59.4705875Z         D: int,
2025-05-07T20:32:59.4706097Z         scale_ub: Optional[float],
2025-05-07T20:32:59.4706372Z         contiguous: bool,
2025-05-07T20:32:59.4706609Z         compiled: bool,
2025-05-07T20:32:59.4706840Z     ) -> None:
2025-05-07T20:32:59.4707059Z         torch.manual_seed(2025)
2025-05-07T20:32:59.4707301Z     
2025-05-07T20:32:59.4707579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.4707928Z     
2025-05-07T20:32:59.4708127Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.4708430Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.4710514Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:59.4712514Z 
2025-05-07T20:32:59.4712640Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:59.4712856Z 
2025-05-07T20:32:59.4712968Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.4713397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.4713800Z     T=4096,
2025-05-07T20:32:59.4713989Z     D=7168,
2025-05-07T20:32:59.4714182Z     scale_ub=1200.0,
2025-05-07T20:32:59.4714402Z     contiguous=True,
2025-05-07T20:32:59.4714626Z     compiled=True,
2025-05-07T20:32:59.4714833Z )
2025-05-07T20:32:59.4715150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.4715650Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:59.4715921Z 
2025-05-07T20:32:59.4716003Z     @given(
2025-05-07T20:32:59.4716229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.4716546Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.4716867Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.4717198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.4717526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.4717812Z     )
2025-05-07T20:32:59.4718237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.4718682Z     def test_silu_mul_quant(
2025-05-07T20:32:59.4718924Z         self,
2025-05-07T20:32:59.4719119Z         T: int,
2025-05-07T20:32:59.4719358Z         D: int,
2025-05-07T20:32:59.4719578Z         scale_ub: Optional[float],
2025-05-07T20:32:59.4719851Z         contiguous: bool,
2025-05-07T20:32:59.4720086Z         compiled: bool,
2025-05-07T20:32:59.4720313Z     ) -> None:
2025-05-07T20:32:59.4720533Z         torch.manual_seed(2025)
2025-05-07T20:32:59.4720769Z     
2025-05-07T20:32:59.4721042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.4721384Z     
2025-05-07T20:32:59.4721579Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.4721879Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.4723957Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:59.4725974Z 
2025-05-07T20:32:59.4726092Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:59.4726309Z 
2025-05-07T20:32:59.4726418Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.4726831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.4727242Z     T=16384,
2025-05-07T20:32:59.4727434Z     D=7168,
2025-05-07T20:32:59.4727622Z     scale_ub=None,
2025-05-07T20:32:59.4727845Z     contiguous=False,
2025-05-07T20:32:59.4728075Z     compiled=False,
2025-05-07T20:32:59.4728275Z )
2025-05-07T20:32:59.4728597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.4729098Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:59.4729378Z 
2025-05-07T20:32:59.4729463Z     @given(
2025-05-07T20:32:59.4729691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.4730006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.4730313Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.4730638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.4730971Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.4731263Z     )
2025-05-07T20:32:59.4731655Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.4732101Z     def test_silu_mul_quant(
2025-05-07T20:32:59.4732346Z         self,
2025-05-07T20:32:59.4732549Z         T: int,
2025-05-07T20:32:59.4732743Z         D: int,
2025-05-07T20:32:59.4732966Z         scale_ub: Optional[float],
2025-05-07T20:32:59.4733239Z         contiguous: bool,
2025-05-07T20:32:59.4733475Z         compiled: bool,
2025-05-07T20:32:59.4733695Z     ) -> None:
2025-05-07T20:32:59.4733913Z         torch.manual_seed(2025)
2025-05-07T20:32:59.4734153Z     
2025-05-07T20:32:59.4734426Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.4736584Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:59.4738608Z 
2025-05-07T20:32:59.4738779Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:59.4738995Z 
2025-05-07T20:32:59.4739104Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.4739519Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.4739962Z     T=2048,
2025-05-07T20:32:59.4740153Z     D=7168,
2025-05-07T20:32:59.4740339Z     scale_ub=1200.0,
2025-05-07T20:32:59.4740564Z     contiguous=True,
2025-05-07T20:32:59.4740787Z     compiled=True,
2025-05-07T20:32:59.4740987Z )
2025-05-07T20:32:59.4741306Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.4741803Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:59.4742076Z 
2025-05-07T20:32:59.4742152Z     @given(
2025-05-07T20:32:59.4742383Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.4742769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.4743078Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.4743406Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.4743738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.4744021Z     )
2025-05-07T20:32:59.4744372Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.4744817Z     def test_silu_mul_quant(
2025-05-07T20:32:59.4745061Z         self,
2025-05-07T20:32:59.4745253Z         T: int,
2025-05-07T20:32:59.4745454Z         D: int,
2025-05-07T20:32:59.4745674Z         scale_ub: Optional[float],
2025-05-07T20:32:59.4745942Z         contiguous: bool,
2025-05-07T20:32:59.4746188Z         compiled: bool,
2025-05-07T20:32:59.4746411Z     ) -> None:
2025-05-07T20:32:59.4746627Z         torch.manual_seed(2025)
2025-05-07T20:32:59.4746874Z     
2025-05-07T20:32:59.4747148Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.4747495Z     
2025-05-07T20:32:59.4747686Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.4747985Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.4750035Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:59.4751931Z 
2025-05-07T20:32:59.4752104Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:59.4752319Z 
2025-05-07T20:32:59.4752423Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.4752842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.4753247Z     T=2048,
2025-05-07T20:32:59.4753438Z     D=7168,
2025-05-07T20:32:59.4753627Z     scale_ub=None,
2025-05-07T20:32:59.4753844Z     contiguous=True,
2025-05-07T20:32:59.4754097Z     compiled=False,
2025-05-07T20:32:59.4754326Z )
2025-05-07T20:32:59.7734628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.7735431Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:59.7735784Z 
2025-05-07T20:32:59.7735873Z     @given(
2025-05-07T20:32:59.7736111Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.7736431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.7736749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.7737086Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.7737419Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.7737712Z     )
2025-05-07T20:32:59.7738292Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.7738744Z     def test_silu_mul_quant(
2025-05-07T20:32:59.7738993Z         self,
2025-05-07T20:32:59.7739193Z         T: int,
2025-05-07T20:32:59.7739391Z         D: int,
2025-05-07T20:32:59.7739678Z         scale_ub: Optional[float],
2025-05-07T20:32:59.7739960Z         contiguous: bool,
2025-05-07T20:32:59.7740202Z         compiled: bool,
2025-05-07T20:32:59.7740432Z     ) -> None:
2025-05-07T20:32:59.7740663Z         torch.manual_seed(2025)
2025-05-07T20:32:59.7740909Z     
2025-05-07T20:32:59.7741191Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.7741535Z     
2025-05-07T20:32:59.7741729Z >       x_sign = torch.sign(x)
2025-05-07T20:32:59.7743725Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:59.7745710Z 
2025-05-07T20:32:59.7745834Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:59.7746053Z 
2025-05-07T20:32:59.7746157Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.7746577Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.7746975Z     T=1,
2025-05-07T20:32:59.7747166Z     D=7168,
2025-05-07T20:32:59.7747363Z     scale_ub=1200.0,
2025-05-07T20:32:59.7747587Z     contiguous=True,
2025-05-07T20:32:59.7747814Z     compiled=False,
2025-05-07T20:32:59.7748025Z )
2025-05-07T20:32:59.7748350Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.7748844Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:59.7749111Z 
2025-05-07T20:32:59.7749188Z     @given(
2025-05-07T20:32:59.7749421Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.7749738Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.7750050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.7750386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.7750713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.7751002Z     )
2025-05-07T20:32:59.7751358Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.7751804Z     def test_silu_mul_quant(
2025-05-07T20:32:59.7752111Z         self,
2025-05-07T20:32:59.7752313Z         T: int,
2025-05-07T20:32:59.7752514Z         D: int,
2025-05-07T20:32:59.7752732Z         scale_ub: Optional[float],
2025-05-07T20:32:59.7753014Z         contiguous: bool,
2025-05-07T20:32:59.7753265Z         compiled: bool,
2025-05-07T20:32:59.7753492Z     ) -> None:
2025-05-07T20:32:59.7753713Z         torch.manual_seed(2025)
2025-05-07T20:32:59.7753965Z     
2025-05-07T20:32:59.7754241Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.7754594Z     
2025-05-07T20:32:59.7754795Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.7755088Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.7755403Z         x = x_sign * x_clamp
2025-05-07T20:32:59.7755957Z         x0 = x[:, :D]
2025-05-07T20:32:59.7756174Z         x1 = x[:, D:]
2025-05-07T20:32:59.7756385Z     
2025-05-07T20:32:59.7756578Z         if contiguous:
2025-05-07T20:32:59.7756815Z             x0 = x0.contiguous()
2025-05-07T20:32:59.7757080Z             x1 = x1.contiguous()
2025-05-07T20:32:59.7757327Z     
2025-05-07T20:32:59.7757522Z         if scale_ub is not None:
2025-05-07T20:32:59.7757804Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.7758221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.7758533Z             )
2025-05-07T20:32:59.7758721Z         else:
2025-05-07T20:32:59.7758937Z             scale_ub_tensor = None
2025-05-07T20:32:59.7759191Z     
2025-05-07T20:32:59.7759486Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.7759801Z             op = silu_mul_quant
2025-05-07T20:32:59.7760051Z             if compiled:
2025-05-07T20:32:59.7760298Z                 op = torch.compile(op)
2025-05-07T20:32:59.7760600Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.7760877Z     
2025-05-07T20:32:59.7761074Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.7761244Z 
2025-05-07T20:32:59.7761347Z moe/activation_test.py:117: 
2025-05-07T20:32:59.7761645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.7761984Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.7762335Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.7763046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.7763751Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.7764295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.7765041Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.7765715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.7766251Z     kernel = self.compile(
2025-05-07T20:32:59.7766800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.7767466Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.7767869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.7768096Z 
2025-05-07T20:32:59.7768327Z self = <triton.compiler.compiler.ASTSource object at 0x7fd504fdf2e0>
2025-05-07T20:32:59.7769419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.7770814Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd504d204c0>}
2025-05-07T20:32:59.7772245Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.7773288Z context = <triton._C.libtriton.ir.context object at 0x7fd504db8970>
2025-05-07T20:32:59.7773584Z 
2025-05-07T20:32:59.7773760Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.7774280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.7774756Z                            module_map=module_map)
2025-05-07T20:32:59.7775126Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.7775478Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.7775740Z E       ^
2025-05-07T20:32:59.7776214Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.7776668Z 
2025-05-07T20:32:59.7777096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.7777611Z 
2025-05-07T20:32:59.7777717Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.7778210Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.7778619Z     T=128,
2025-05-07T20:32:59.7778805Z     D=5120,
2025-05-07T20:32:59.7779044Z     scale_ub=None,
2025-05-07T20:32:59.7779261Z     contiguous=True,
2025-05-07T20:32:59.7779484Z     compiled=False,
2025-05-07T20:32:59.7779689Z )
2025-05-07T20:32:59.8550535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.8551320Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:59.8551643Z 
2025-05-07T20:32:59.8551727Z     @given(
2025-05-07T20:32:59.8551968Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.8552289Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.8552610Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.8552957Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.8553294Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.8553586Z     )
2025-05-07T20:32:59.8554038Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.8554488Z     def test_silu_mul_quant(
2025-05-07T20:32:59.8554738Z         self,
2025-05-07T20:32:59.8554939Z         T: int,
2025-05-07T20:32:59.8555141Z         D: int,
2025-05-07T20:32:59.8555375Z         scale_ub: Optional[float],
2025-05-07T20:32:59.8555825Z         contiguous: bool,
2025-05-07T20:32:59.8556071Z         compiled: bool,
2025-05-07T20:32:59.8556309Z     ) -> None:
2025-05-07T20:32:59.8556535Z         torch.manual_seed(2025)
2025-05-07T20:32:59.8556787Z     
2025-05-07T20:32:59.8557065Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.8557420Z     
2025-05-07T20:32:59.8557624Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.8557923Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.8558240Z         x = x_sign * x_clamp
2025-05-07T20:32:59.8558497Z         x0 = x[:, :D]
2025-05-07T20:32:59.8558719Z         x1 = x[:, D:]
2025-05-07T20:32:59.8558935Z     
2025-05-07T20:32:59.8559131Z         if contiguous:
2025-05-07T20:32:59.8559370Z             x0 = x0.contiguous()
2025-05-07T20:32:59.8559639Z             x1 = x1.contiguous()
2025-05-07T20:32:59.8559889Z     
2025-05-07T20:32:59.8560086Z         if scale_ub is not None:
2025-05-07T20:32:59.8560372Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.8560719Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.8561028Z             )
2025-05-07T20:32:59.8561231Z         else:
2025-05-07T20:32:59.8561449Z             scale_ub_tensor = None
2025-05-07T20:32:59.8561703Z     
2025-05-07T20:32:59.8561944Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.8562264Z             op = silu_mul_quant
2025-05-07T20:32:59.8562597Z             if compiled:
2025-05-07T20:32:59.8562852Z                 op = torch.compile(op)
2025-05-07T20:32:59.8563160Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.8563443Z     
2025-05-07T20:32:59.8563640Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.8563815Z 
2025-05-07T20:32:59.8563920Z moe/activation_test.py:117: 
2025-05-07T20:32:59.8564221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.8564551Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.8564846Z     return op(x0, x1, scale_ub_tensor)
﻿2025-05-07T20:32:59.8568092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.8568792Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.8569343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.8570044Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.8570723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.8571271Z     kernel = self.compile(
2025-05-07T20:32:59.8571934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.8572607Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.8573005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.8573252Z 
2025-05-07T20:32:59.8573468Z self = <triton.compiler.compiler.ASTSource object at 0x7fd504d3d660>
2025-05-07T20:32:59.8574560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.8575960Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd504d20940>}
2025-05-07T20:32:59.8577404Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.8578568Z context = <triton._C.libtriton.ir.context object at 0x7fd504d667f0>
2025-05-07T20:32:59.8578858Z 
2025-05-07T20:32:59.8579033Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.8585722Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.8586317Z                            module_map=module_map)
2025-05-07T20:32:59.8586740Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.8587138Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.8587430Z E       ^
2025-05-07T20:32:59.8587979Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.8588536Z 
2025-05-07T20:32:59.8589008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.8589541Z 
2025-05-07T20:32:59.8589646Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.8590076Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.8590486Z     T=128,
2025-05-07T20:32:59.8590676Z     D=7168,
2025-05-07T20:32:59.8590876Z     scale_ub=None,
2025-05-07T20:32:59.8591092Z     contiguous=True,
2025-05-07T20:32:59.8591315Z     compiled=False,
2025-05-07T20:32:59.8591527Z )
2025-05-07T20:32:59.8591855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.8592423Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:59.8592701Z 
2025-05-07T20:32:59.8592780Z     @given(
2025-05-07T20:32:59.8593015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.8593337Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.8593648Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.8593983Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.8594319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.8594605Z     )
2025-05-07T20:32:59.8594961Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.8595501Z     def test_silu_mul_quant(
2025-05-07T20:32:59.8595741Z         self,
2025-05-07T20:32:59.8595937Z         T: int,
2025-05-07T20:32:59.8596137Z         D: int,
2025-05-07T20:32:59.8596355Z         scale_ub: Optional[float],
2025-05-07T20:32:59.8596633Z         contiguous: bool,
2025-05-07T20:32:59.8596878Z         compiled: bool,
2025-05-07T20:32:59.8597098Z     ) -> None:
2025-05-07T20:32:59.8597324Z         torch.manual_seed(2025)
2025-05-07T20:32:59.8597568Z     
2025-05-07T20:32:59.8597855Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.8598198Z     
2025-05-07T20:32:59.8598396Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.8598740Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.8599050Z         x = x_sign * x_clamp
2025-05-07T20:32:59.8599294Z         x0 = x[:, :D]
2025-05-07T20:32:59.8599513Z         x1 = x[:, D:]
2025-05-07T20:32:59.8599717Z     
2025-05-07T20:32:59.8599911Z         if contiguous:
2025-05-07T20:32:59.8600152Z             x0 = x0.contiguous()
2025-05-07T20:32:59.8600412Z             x1 = x1.contiguous()
2025-05-07T20:32:59.8600648Z     
2025-05-07T20:32:59.8600849Z         if scale_ub is not None:
2025-05-07T20:32:59.8601120Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.8601461Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.8601777Z             )
2025-05-07T20:32:59.8601965Z         else:
2025-05-07T20:32:59.8602178Z             scale_ub_tensor = None
2025-05-07T20:32:59.8602433Z     
2025-05-07T20:32:59.8602709Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.8603033Z             op = silu_mul_quant
2025-05-07T20:32:59.8603289Z             if compiled:
2025-05-07T20:32:59.8603543Z                 op = torch.compile(op)
2025-05-07T20:32:59.8603843Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.8604111Z     
2025-05-07T20:32:59.8604309Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.8604478Z 
2025-05-07T20:32:59.8604582Z moe/activation_test.py:117: 
2025-05-07T20:32:59.8604877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.8605209Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.8605494Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.8606189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.8606895Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.8607444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.8608138Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.8608808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.8609346Z     kernel = self.compile(
2025-05-07T20:32:59.8609898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.8610572Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.8610968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.8611200Z 
2025-05-07T20:32:59.8611458Z self = <triton.compiler.compiler.ASTSource object at 0x7fd504d491e0>
2025-05-07T20:32:59.8612565Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.8613971Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd504d21240>}
2025-05-07T20:32:59.8615386Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.8616498Z context = <triton._C.libtriton.ir.context object at 0x7fd504e6f3f0>
2025-05-07T20:32:59.8616795Z 
2025-05-07T20:32:59.8616964Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.8617499Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.8617968Z                            module_map=module_map)
2025-05-07T20:32:59.8618427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.8618834Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.8619095Z E       ^
2025-05-07T20:32:59.8619569Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.8620034Z 
2025-05-07T20:32:59.8620459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.8620985Z 
2025-05-07T20:32:59.8621095Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.8621513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.8621923Z     T=2048,
2025-05-07T20:32:59.8622115Z     D=7168,
2025-05-07T20:32:59.8622308Z     scale_ub=1200.0,
2025-05-07T20:32:59.8622540Z     contiguous=True,
2025-05-07T20:32:59.8622766Z     compiled=False,
2025-05-07T20:32:59.8622972Z )
2025-05-07T20:32:59.9566019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.9566673Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:59.9567025Z 
2025-05-07T20:32:59.9567107Z     @given(
2025-05-07T20:32:59.9567349Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.9567671Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.9567979Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.9568323Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.9568662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.9568959Z     )
2025-05-07T20:32:59.9569321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.9569773Z     def test_silu_mul_quant(
2025-05-07T20:32:59.9570018Z         self,
2025-05-07T20:32:59.9570220Z         T: int,
2025-05-07T20:32:59.9570422Z         D: int,
2025-05-07T20:32:59.9570645Z         scale_ub: Optional[float],
2025-05-07T20:32:59.9570931Z         contiguous: bool,
2025-05-07T20:32:59.9571180Z         compiled: bool,
2025-05-07T20:32:59.9571411Z     ) -> None:
2025-05-07T20:32:59.9571630Z         torch.manual_seed(2025)
2025-05-07T20:32:59.9571880Z     
2025-05-07T20:32:59.9572165Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.9574424Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:59.9576363Z 
2025-05-07T20:32:59.9576488Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:59.9576711Z 
2025-05-07T20:32:59.9576819Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.9577250Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.9577659Z     T=1,
2025-05-07T20:32:59.9577845Z     D=5120,
2025-05-07T20:32:59.9578135Z     scale_ub=1200.0,
2025-05-07T20:32:59.9578369Z     contiguous=True,
2025-05-07T20:32:59.9578681Z     compiled=False,
2025-05-07T20:32:59.9578889Z )
2025-05-07T20:32:59.9579218Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.9579711Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:59.9579987Z 
2025-05-07T20:32:59.9580066Z     @given(
2025-05-07T20:32:59.9580305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.9580620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.9580934Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.9581278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.9581685Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.9581976Z     )
2025-05-07T20:32:59.9582336Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.9582788Z     def test_silu_mul_quant(
2025-05-07T20:32:59.9583031Z         self,
2025-05-07T20:32:59.9583233Z         T: int,
2025-05-07T20:32:59.9583435Z         D: int,
2025-05-07T20:32:59.9583655Z         scale_ub: Optional[float],
2025-05-07T20:32:59.9583932Z         contiguous: bool,
2025-05-07T20:32:59.9584177Z         compiled: bool,
2025-05-07T20:32:59.9584399Z     ) -> None:
2025-05-07T20:32:59.9584623Z         torch.manual_seed(2025)
2025-05-07T20:32:59.9584868Z     
2025-05-07T20:32:59.9585150Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.9585497Z     
2025-05-07T20:32:59.9585696Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.9586064Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.9586384Z         x = x_sign * x_clamp
2025-05-07T20:32:59.9586629Z         x0 = x[:, :D]
2025-05-07T20:32:59.9586856Z         x1 = x[:, D:]
2025-05-07T20:32:59.9587064Z     
2025-05-07T20:32:59.9587260Z         if contiguous:
2025-05-07T20:32:59.9587501Z             x0 = x0.contiguous()
2025-05-07T20:32:59.9587758Z             x1 = x1.contiguous()
2025-05-07T20:32:59.9588006Z     
2025-05-07T20:32:59.9588202Z         if scale_ub is not None:
2025-05-07T20:32:59.9588481Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.9588823Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.9589136Z             )
2025-05-07T20:32:59.9589331Z         else:
2025-05-07T20:32:59.9589555Z             scale_ub_tensor = None
2025-05-07T20:32:59.9589815Z     
2025-05-07T20:32:59.9590055Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.9590378Z             op = silu_mul_quant
2025-05-07T20:32:59.9590637Z             if compiled:
2025-05-07T20:32:59.9590890Z                 op = torch.compile(op)
2025-05-07T20:32:59.9591196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.9591481Z     
2025-05-07T20:32:59.9591675Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.9591851Z 
2025-05-07T20:32:59.9591957Z moe/activation_test.py:117: 
2025-05-07T20:32:59.9592258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.9592596Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.9592883Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.9593589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.9594342Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.9594893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.9595636Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.9596318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.9596864Z     kernel = self.compile(
2025-05-07T20:32:59.9597417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.9598135Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.9598534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.9598762Z 
2025-05-07T20:32:59.9598979Z self = <triton.compiler.compiler.ASTSource object at 0x7fd504d49d20>
2025-05-07T20:32:59.9600085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.9601613Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd504d22200>}
2025-05-07T20:32:59.9602988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.9604040Z context = <triton._C.libtriton.ir.context object at 0x7fd504e55830>
2025-05-07T20:32:59.9604330Z 
2025-05-07T20:32:59.9604504Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.9605032Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.9605512Z                            module_map=module_map)
2025-05-07T20:32:59.9605886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.9606288Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.9606543Z E       ^
2025-05-07T20:32:59.9607021Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.9607482Z 
2025-05-07T20:32:59.9607910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.9608432Z 
2025-05-07T20:32:59.9608548Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.9608964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.9609372Z     T=2048,
2025-05-07T20:32:59.9609564Z     D=5120,
2025-05-07T20:32:59.9609753Z     scale_ub=None,
2025-05-07T20:32:59.9609970Z     contiguous=True,
2025-05-07T20:32:59.9610196Z     compiled=False,
2025-05-07T20:32:59.9610398Z )
2025-05-07T20:32:59.9610725Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.9611222Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:59.9611496Z 
2025-05-07T20:32:59.9611571Z     @given(
2025-05-07T20:32:59.9611808Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.9612127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.9612436Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.9612766Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.9613101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.9613390Z     )
2025-05-07T20:32:59.9613740Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.9614187Z     def test_silu_mul_quant(
2025-05-07T20:32:59.9614432Z         self,
2025-05-07T20:32:59.9614626Z         T: int,
2025-05-07T20:32:59.9614825Z         D: int,
2025-05-07T20:32:59.9615089Z         scale_ub: Optional[float],
2025-05-07T20:32:59.9615363Z         contiguous: bool,
2025-05-07T20:32:59.9615607Z         compiled: bool,
2025-05-07T20:32:59.9615836Z     ) -> None:
2025-05-07T20:32:59.9616052Z         torch.manual_seed(2025)
2025-05-07T20:32:59.9616297Z     
2025-05-07T20:32:59.9616572Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.9616910Z     
2025-05-07T20:32:59.9617107Z >       x_sign = torch.sign(x)
2025-05-07T20:32:59.9619226Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:59.9621181Z 
2025-05-07T20:32:59.9621303Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:59.9621520Z 
2025-05-07T20:32:59.9621632Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.9622090Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.9622496Z     T=16384,
2025-05-07T20:32:59.9622700Z     D=5120,
2025-05-07T20:32:59.9622892Z     scale_ub=None,
2025-05-07T20:32:59.9623110Z     contiguous=True,
2025-05-07T20:32:59.9623341Z     compiled=False,
2025-05-07T20:32:59.9623552Z )
2025-05-07T20:33:00.0595752Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.0596392Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.0596810Z 
2025-05-07T20:33:00.0596928Z     @given(
2025-05-07T20:33:00.0597201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.0597591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.0597976Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.0598439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.0598779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.0599071Z     )
2025-05-07T20:33:00.0599434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.0599893Z     def test_silu_mul_quant(
2025-05-07T20:33:00.0600140Z         self,
2025-05-07T20:33:00.0600343Z         T: int,
2025-05-07T20:33:00.0600550Z         D: int,
2025-05-07T20:33:00.0600767Z         scale_ub: Optional[float],
2025-05-07T20:33:00.0601049Z         contiguous: bool,
2025-05-07T20:33:00.0601295Z         compiled: bool,
2025-05-07T20:33:00.0601523Z     ) -> None:
2025-05-07T20:33:00.0601747Z         torch.manual_seed(2025)
2025-05-07T20:33:00.0601996Z     
2025-05-07T20:33:00.0602273Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.0604392Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.0606320Z 
2025-05-07T20:33:00.0606440Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.0606666Z 
2025-05-07T20:33:00.0606776Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.0607201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.0607603Z     T=4096,
2025-05-07T20:33:00.0607868Z     D=5120,
2025-05-07T20:33:00.0608066Z     scale_ub=None,
2025-05-07T20:33:00.0608282Z     contiguous=True,
2025-05-07T20:33:00.0608514Z     compiled=False,
2025-05-07T20:33:00.0608728Z )
2025-05-07T20:33:00.0609051Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.0609557Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.0609829Z 
2025-05-07T20:33:00.0609914Z     @given(
2025-05-07T20:33:00.0610143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.0610467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.0610857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.0611194Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.0611528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.0611820Z     )
2025-05-07T20:33:00.0612179Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.0612626Z     def test_silu_mul_quant(
2025-05-07T20:33:00.0612874Z         self,
2025-05-07T20:33:00.0613073Z         T: int,
2025-05-07T20:33:00.0613273Z         D: int,
2025-05-07T20:33:00.0613502Z         scale_ub: Optional[float],
2025-05-07T20:33:00.0613787Z         contiguous: bool,
2025-05-07T20:33:00.0614094Z         compiled: bool,
2025-05-07T20:33:00.0614325Z     ) -> None:
2025-05-07T20:33:00.0614551Z         torch.manual_seed(2025)
2025-05-07T20:33:00.0614792Z     
2025-05-07T20:33:00.0615072Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.0617189Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.0619223Z 
2025-05-07T20:33:00.0619345Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.0619562Z 
2025-05-07T20:33:00.0619677Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.0620097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.0620511Z     T=2048,
2025-05-07T20:33:00.0620706Z     D=5120,
2025-05-07T20:33:00.0620896Z     scale_ub=None,
2025-05-07T20:33:00.0621120Z     contiguous=False,
2025-05-07T20:33:00.0621354Z     compiled=False,
2025-05-07T20:33:00.0621557Z )
2025-05-07T20:33:00.0621881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.0622389Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.0622666Z 
2025-05-07T20:33:00.0622748Z     @given(
2025-05-07T20:33:00.0622980Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.0623298Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.0623618Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.0623946Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.0624281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.0624573Z     )
2025-05-07T20:33:00.0624923Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.0625372Z     def test_silu_mul_quant(
2025-05-07T20:33:00.0625620Z         self,
2025-05-07T20:33:00.0625826Z         T: int,
2025-05-07T20:33:00.0626020Z         D: int,
2025-05-07T20:33:00.0626244Z         scale_ub: Optional[float],
2025-05-07T20:33:00.0626518Z         contiguous: bool,
2025-05-07T20:33:00.0626757Z         compiled: bool,
2025-05-07T20:33:00.0626985Z     ) -> None:
2025-05-07T20:33:00.0627204Z         torch.manual_seed(2025)
2025-05-07T20:33:00.0627494Z     
2025-05-07T20:33:00.0627773Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.0629873Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.0631814Z 
2025-05-07T20:33:00.0631939Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.0632154Z 
2025-05-07T20:33:00.0632264Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.0632684Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.0633094Z     T=4096,
2025-05-07T20:33:00.0633284Z     D=7168,
2025-05-07T20:33:00.0633471Z     scale_ub=None,
2025-05-07T20:33:00.0633691Z     contiguous=True,
2025-05-07T20:33:00.0633919Z     compiled=True,
2025-05-07T20:33:00.0634118Z )
2025-05-07T20:33:00.0634486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.0634985Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.0635254Z 
2025-05-07T20:33:00.0635332Z     @given(
2025-05-07T20:33:00.0635568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.0635889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.0636217Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.0636547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.0636880Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.0637171Z     )
2025-05-07T20:33:00.0637527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.0637977Z     def test_silu_mul_quant(
2025-05-07T20:33:00.0638223Z         self,
2025-05-07T20:33:00.0638470Z         T: int,
2025-05-07T20:33:00.0638665Z         D: int,
2025-05-07T20:33:00.0638891Z         scale_ub: Optional[float],
2025-05-07T20:33:00.0639163Z         contiguous: bool,
2025-05-07T20:33:00.0639400Z         compiled: bool,
2025-05-07T20:33:00.0639628Z     ) -> None:
2025-05-07T20:33:00.0639850Z         torch.manual_seed(2025)
2025-05-07T20:33:00.0640094Z     
2025-05-07T20:33:00.0640369Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.0642475Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.0644373Z 
2025-05-07T20:33:00.0644498Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.0644721Z 
2025-05-07T20:33:00.0644830Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.0645244Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.0645652Z     T=2048,
2025-05-07T20:33:00.0645843Z     D=5120,
2025-05-07T20:33:00.0646038Z     scale_ub=1200.0,
2025-05-07T20:33:00.0646264Z     contiguous=False,
2025-05-07T20:33:00.0646494Z     compiled=False,
2025-05-07T20:33:00.0646696Z )
2025-05-07T20:33:00.0647019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.0647521Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.0647847Z 
2025-05-07T20:33:00.0647926Z     @given(
2025-05-07T20:33:00.0648162Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.0648480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.0648787Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.0649122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.0649462Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.0649754Z     )
2025-05-07T20:33:00.0650102Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.0650623Z     def test_silu_mul_quant(
2025-05-07T20:33:00.0656754Z         self,
2025-05-07T20:33:00.0656986Z         T: int,
2025-05-07T20:33:00.0657181Z         D: int,
2025-05-07T20:33:00.0657392Z         scale_ub: Optional[float],
2025-05-07T20:33:00.0657663Z         contiguous: bool,
2025-05-07T20:33:00.0657904Z         compiled: bool,
2025-05-07T20:33:00.0658201Z     ) -> None:
2025-05-07T20:33:00.0658425Z         torch.manual_seed(2025)
2025-05-07T20:33:00.0658668Z     
2025-05-07T20:33:00.0658936Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.0661156Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.0663070Z 
2025-05-07T20:33:00.0663191Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.0663410Z 
2025-05-07T20:33:00.0663514Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.0663934Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.0664329Z     T=4096,
2025-05-07T20:33:00.0664514Z     D=7168,
2025-05-07T20:33:00.0664772Z     scale_ub=1200.0,
2025-05-07T20:33:00.0664989Z     contiguous=True,
2025-05-07T20:33:00.0665211Z     compiled=False,
2025-05-07T20:33:00.0665413Z )
2025-05-07T20:33:00.1931348Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.1931873Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.1934391Z 
2025-05-07T20:33:00.1934790Z     @given(
2025-05-07T20:33:00.1935226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.1935652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.1935987Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.1936351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.1936703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.1937009Z     )
2025-05-07T20:33:00.1937381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.1937862Z     def test_silu_mul_quant(
2025-05-07T20:33:00.1938225Z         self,
2025-05-07T20:33:00.1938440Z         T: int,
2025-05-07T20:33:00.1938661Z         D: int,
2025-05-07T20:33:00.1938889Z         scale_ub: Optional[float],
2025-05-07T20:33:00.1939181Z         contiguous: bool,
2025-05-07T20:33:00.1939437Z         compiled: bool,
2025-05-07T20:33:00.1939672Z     ) -> None:
2025-05-07T20:33:00.1939910Z         torch.manual_seed(2025)
2025-05-07T20:33:00.1940171Z     
2025-05-07T20:33:00.1940455Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.1942903Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.1944949Z 
2025-05-07T20:33:00.1945076Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.1945309Z 
2025-05-07T20:33:00.1945419Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.1945856Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.1946379Z     T=16384,
2025-05-07T20:33:00.1946589Z     D=7168,
2025-05-07T20:33:00.1946797Z     scale_ub=None,
2025-05-07T20:33:00.1947018Z     contiguous=False,
2025-05-07T20:33:00.1947261Z     compiled=True,
2025-05-07T20:33:00.1947485Z )
2025-05-07T20:33:00.1947812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.1948339Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.1948635Z 
2025-05-07T20:33:00.1948729Z     @given(
2025-05-07T20:33:00.1948966Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.1949420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.1949745Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.1950081Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.1950423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.1950721Z     )
2025-05-07T20:33:00.1951078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.1951542Z     def test_silu_mul_quant(
2025-05-07T20:33:00.1951796Z         self,
2025-05-07T20:33:00.1951998Z         T: int,
2025-05-07T20:33:00.1952203Z         D: int,
2025-05-07T20:33:00.1952435Z         scale_ub: Optional[float],
2025-05-07T20:33:00.1952725Z         contiguous: bool,
2025-05-07T20:33:00.1952971Z         compiled: bool,
2025-05-07T20:33:00.1953207Z     ) -> None:
2025-05-07T20:33:00.1953434Z         torch.manual_seed(2025)
2025-05-07T20:33:00.1953773Z     
2025-05-07T20:33:00.1954092Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.1956525Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.1958479Z 
2025-05-07T20:33:00.1958603Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.1958824Z 
2025-05-07T20:33:00.1958948Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.1959372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.1959794Z     T=4096,
2025-05-07T20:33:00.1959994Z     D=7168,
2025-05-07T20:33:00.1960195Z     scale_ub=None,
2025-05-07T20:33:00.1960418Z     contiguous=True,
2025-05-07T20:33:00.1960657Z     compiled=False,
2025-05-07T20:33:00.1960876Z )
2025-05-07T20:33:00.1961203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.1961716Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.1961996Z 
2025-05-07T20:33:00.1962086Z     @given(
2025-05-07T20:33:00.1962320Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.1962648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.1962969Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.1963306Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.1963726Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.1964033Z     )
2025-05-07T20:33:00.1964410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.1964870Z     def test_silu_mul_quant(
2025-05-07T20:33:00.1965127Z         self,
2025-05-07T20:33:00.1965334Z         T: int,
2025-05-07T20:33:00.1965534Z         D: int,
2025-05-07T20:33:00.1965761Z         scale_ub: Optional[float],
2025-05-07T20:33:00.1966043Z         contiguous: bool,
2025-05-07T20:33:00.1966289Z         compiled: bool,
2025-05-07T20:33:00.1966523Z     ) -> None:
2025-05-07T20:33:00.1966820Z         torch.manual_seed(2025)
2025-05-07T20:33:00.1967067Z     
2025-05-07T20:33:00.1967354Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.1969542Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.1971483Z 
2025-05-07T20:33:00.1971606Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.1971824Z 
2025-05-07T20:33:00.1971939Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.1972366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.1972787Z     T=16384,
2025-05-07T20:33:00.1972993Z     D=7168,
2025-05-07T20:33:00.1973185Z     scale_ub=None,
2025-05-07T20:33:00.1973412Z     contiguous=True,
2025-05-07T20:33:00.1973651Z     compiled=False,
2025-05-07T20:33:00.1973858Z )
2025-05-07T20:33:00.1974195Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.1974710Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.1975070Z 
2025-05-07T20:33:00.1975161Z     @given(
2025-05-07T20:33:00.1975397Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.1975722Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.1976041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.1976377Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.1976719Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.1977020Z     )
2025-05-07T20:33:00.1977373Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.1977829Z     def test_silu_mul_quant(
2025-05-07T20:33:00.1978154Z         self,
2025-05-07T20:33:00.1978354Z         T: int,
2025-05-07T20:33:00.1978558Z         D: int,
2025-05-07T20:33:00.1978788Z         scale_ub: Optional[float],
2025-05-07T20:33:00.1979069Z         contiguous: bool,
2025-05-07T20:33:00.1979321Z         compiled: bool,
2025-05-07T20:33:00.1979558Z     ) -> None:
2025-05-07T20:33:00.1979787Z         torch.manual_seed(2025)
2025-05-07T20:33:00.1980032Z     
2025-05-07T20:33:00.1980318Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.1982444Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.1984419Z 
2025-05-07T20:33:00.1984606Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.1984826Z 
2025-05-07T20:33:00.1984932Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.1985363Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.1985784Z     T=16384,
2025-05-07T20:33:00.1985988Z     D=7168,
2025-05-07T20:33:00.1986180Z     scale_ub=1200.0,
2025-05-07T20:33:00.1986416Z     contiguous=True,
2025-05-07T20:33:00.1986650Z     compiled=False,
2025-05-07T20:33:00.1986856Z )
2025-05-07T20:33:00.1987190Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.1987754Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.1988037Z 
2025-05-07T20:33:00.1988117Z     @given(
2025-05-07T20:33:00.1988361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.1988689Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.1989000Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.1989346Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.1989690Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.1989988Z     )
2025-05-07T20:33:00.1990343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.1990846Z     def test_silu_mul_quant(
2025-05-07T20:33:00.1991099Z         self,
2025-05-07T20:33:00.1991297Z         T: int,
2025-05-07T20:33:00.1991503Z         D: int,
2025-05-07T20:33:00.1991733Z         scale_ub: Optional[float],
2025-05-07T20:33:00.1992010Z         contiguous: bool,
2025-05-07T20:33:00.1992269Z         compiled: bool,
2025-05-07T20:33:00.1992506Z     ) -> None:
2025-05-07T20:33:00.1992725Z         torch.manual_seed(2025)
2025-05-07T20:33:00.1992979Z     
2025-05-07T20:33:00.1993266Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.1995446Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.1997423Z 
2025-05-07T20:33:00.1997554Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.1997774Z 
2025-05-07T20:33:00.1997881Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.1998314Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.1998734Z     T=128,
2025-05-07T20:33:00.1998924Z     D=5120,
2025-05-07T20:33:00.1999131Z     scale_ub=1200.0,
2025-05-07T20:33:00.1999374Z     contiguous=False,
2025-05-07T20:33:00.1999601Z     compiled=False,
2025-05-07T20:33:00.1999818Z )
2025-05-07T20:33:00.3413539Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.3414153Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.3414486Z 
2025-05-07T20:33:00.3414590Z     @given(
2025-05-07T20:33:00.3414834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.3415171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.3415500Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.3415844Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.3416202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.3416511Z     )
2025-05-07T20:33:00.3416877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.3417340Z     def test_silu_mul_quant(
2025-05-07T20:33:00.3417598Z         self,
2025-05-07T20:33:00.3417801Z         T: int,
2025-05-07T20:33:00.3418371Z         D: int,
2025-05-07T20:33:00.3418616Z         scale_ub: Optional[float],
2025-05-07T20:33:00.3418908Z         contiguous: bool,
2025-05-07T20:33:00.3419161Z         compiled: bool,
2025-05-07T20:33:00.3419408Z     ) -> None:
2025-05-07T20:33:00.3419643Z         torch.manual_seed(2025)
2025-05-07T20:33:00.3419900Z     
2025-05-07T20:33:00.3420195Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.3420556Z     
2025-05-07T20:33:00.3420762Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.3421073Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.3421501Z         x = x_sign * x_clamp
2025-05-07T20:33:00.3421754Z         x0 = x[:, :D]
2025-05-07T20:33:00.3421988Z         x1 = x[:, D:]
2025-05-07T20:33:00.3422215Z     
2025-05-07T20:33:00.3422414Z         if contiguous:
2025-05-07T20:33:00.3422665Z             x0 = x0.contiguous()
2025-05-07T20:33:00.3422941Z             x1 = x1.contiguous()
2025-05-07T20:33:00.3423194Z     
2025-05-07T20:33:00.3423412Z         if scale_ub is not None:
2025-05-07T20:33:00.3423709Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.3424061Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.3424397Z             )
2025-05-07T20:33:00.3424690Z         else:
2025-05-07T20:33:00.3424923Z             scale_ub_tensor = None
2025-05-07T20:33:00.3425189Z     
2025-05-07T20:33:00.3425441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.3425778Z             op = silu_mul_quant
2025-05-07T20:33:00.3426038Z             if compiled:
2025-05-07T20:33:00.3426307Z                 op = torch.compile(op)
2025-05-07T20:33:00.3426626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.3426909Z     
2025-05-07T20:33:00.3427119Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.3427292Z 
2025-05-07T20:33:00.3427411Z moe/activation_test.py:117: 
2025-05-07T20:33:00.3427719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.3428074Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.3428380Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.3429199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.3429919Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.3430492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.3431212Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.3431900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.3432459Z     kernel = self.compile(
2025-05-07T20:33:00.3433028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.3433715Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.3434122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.3434369Z 
2025-05-07T20:33:00.3434584Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505104280>
2025-05-07T20:33:00.3435708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.3437149Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd504c45ea0>}
2025-05-07T20:33:00.3438534Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.3439652Z context = <triton._C.libtriton.ir.context object at 0x7fd5051db470>
2025-05-07T20:33:00.3439959Z 
2025-05-07T20:33:00.3440134Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.3440687Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.3441173Z                            module_map=module_map)
2025-05-07T20:33:00.3441558Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.3441935Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.3442214Z E       ^
2025-05-07T20:33:00.3442780Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.3443252Z 
2025-05-07T20:33:00.3443682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.3444224Z 
2025-05-07T20:33:00.3444358Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.3444812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.3445220Z     T=2048,
2025-05-07T20:33:00.3445429Z     D=7168,
2025-05-07T20:33:00.3445636Z     scale_ub=None,
2025-05-07T20:33:00.3445860Z     contiguous=False,
2025-05-07T20:33:00.3446151Z     compiled=False,
2025-05-07T20:33:00.3446378Z )
2025-05-07T20:33:00.3446709Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.3447230Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.3447509Z 
2025-05-07T20:33:00.3447605Z     @given(
2025-05-07T20:33:00.3448061Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.3448394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.3448722Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.3449070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.3449412Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.3449715Z     )
2025-05-07T20:33:00.3450082Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.3450589Z     def test_silu_mul_quant(
2025-05-07T20:33:00.3450848Z         self,
2025-05-07T20:33:00.3451060Z         T: int,
2025-05-07T20:33:00.3451264Z         D: int,
2025-05-07T20:33:00.3451497Z         scale_ub: Optional[float],
2025-05-07T20:33:00.3451783Z         contiguous: bool,
2025-05-07T20:33:00.3452030Z         compiled: bool,
2025-05-07T20:33:00.3452271Z     ) -> None:
2025-05-07T20:33:00.3452505Z         torch.manual_seed(2025)
2025-05-07T20:33:00.3452757Z     
2025-05-07T20:33:00.3453046Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.3455251Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.3457417Z 
2025-05-07T20:33:00.3457545Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.3457768Z 
2025-05-07T20:33:00.3457886Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.3458379Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.3458800Z     T=128,
2025-05-07T20:33:00.3459003Z     D=7168,
2025-05-07T20:33:00.3459206Z     scale_ub=1200.0,
2025-05-07T20:33:00.3459445Z     contiguous=True,
2025-05-07T20:33:00.3459687Z     compiled=True,
2025-05-07T20:33:00.3459899Z )
2025-05-07T20:33:00.3879292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.3880079Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.3880483Z 
2025-05-07T20:33:00.3880609Z     @given(
2025-05-07T20:33:00.3880875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.3881200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.3881518Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.3881864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.3882199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.3882499Z     )
2025-05-07T20:33:00.3882980Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.3883430Z     def test_silu_mul_quant(
2025-05-07T20:33:00.3883685Z         self,
2025-05-07T20:33:00.3883893Z         T: int,
2025-05-07T20:33:00.3884095Z         D: int,
2025-05-07T20:33:00.3884331Z         scale_ub: Optional[float],
2025-05-07T20:33:00.3884622Z         contiguous: bool,
2025-05-07T20:33:00.3884874Z         compiled: bool,
2025-05-07T20:33:00.3885105Z     ) -> None:
2025-05-07T20:33:00.3885337Z         torch.manual_seed(2025)
2025-05-07T20:33:00.3885595Z     
2025-05-07T20:33:00.3885957Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.3886318Z     
2025-05-07T20:33:00.3886527Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.3886823Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.3887146Z         x = x_sign * x_clamp
2025-05-07T20:33:00.3887399Z         x0 = x[:, :D]
2025-05-07T20:33:00.3887625Z         x1 = x[:, D:]
2025-05-07T20:33:00.3887845Z     
2025-05-07T20:33:00.3888043Z         if contiguous:
2025-05-07T20:33:00.3888281Z             x0 = x0.contiguous()
2025-05-07T20:33:00.3888551Z             x1 = x1.contiguous()
2025-05-07T20:33:00.3888806Z     
2025-05-07T20:33:00.3889006Z         if scale_ub is not None:
2025-05-07T20:33:00.3889291Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.3889644Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.3889968Z             )
2025-05-07T20:33:00.3890165Z         else:
2025-05-07T20:33:00.3890464Z             scale_ub_tensor = None
2025-05-07T20:33:00.3890728Z     
2025-05-07T20:33:00.3890968Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.3891295Z             op = silu_mul_quant
2025-05-07T20:33:00.3891554Z             if compiled:
2025-05-07T20:33:00.3891805Z                 op = torch.compile(op)
2025-05-07T20:33:00.3892118Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.3892409Z     
2025-05-07T20:33:00.3892606Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.3892785Z 
2025-05-07T20:33:00.3892889Z moe/activation_test.py:117: 
2025-05-07T20:33:00.3893197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.3893529Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.3893828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.3894415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.3895001Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.3895679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.3896392Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.3896947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.3897655Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.3898466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.3899024Z     kernel = self.compile(
2025-05-07T20:33:00.3899644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.3900316Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.3900726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.3900969Z 
2025-05-07T20:33:00.3901186Z self = <triton.compiler.compiler.ASTSource object at 0x7fd504b53e80>
2025-05-07T20:33:00.3902302Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.3903771Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd504c477f0>}
2025-05-07T20:33:00.3905213Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.3906278Z context = <triton._C.libtriton.ir.context object at 0x7fd504b5a530>
2025-05-07T20:33:00.3906576Z 
2025-05-07T20:33:00.3906761Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.3907346Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.3907830Z                            module_map=module_map)
2025-05-07T20:33:00.3908212Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.3908583Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.3908852Z E       ^
2025-05-07T20:33:00.3909355Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.3909826Z 
2025-05-07T20:33:00.3910252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.3910778Z 
2025-05-07T20:33:00.3910894Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.3920000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.3920519Z     T=128,
2025-05-07T20:33:00.3920721Z     D=7168,
2025-05-07T20:33:00.3920930Z     scale_ub=1200.0,
2025-05-07T20:33:00.3921157Z     contiguous=True,
2025-05-07T20:33:00.3921394Z     compiled=False,
2025-05-07T20:33:00.3921617Z )
2025-05-07T20:33:00.3921943Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.3922457Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.3922738Z 
2025-05-07T20:33:00.3922829Z     @given(
2025-05-07T20:33:00.3923064Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.3923390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.3923716Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.3924065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.3924424Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.3924744Z     )
2025-05-07T20:33:00.3925111Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.3925561Z     def test_silu_mul_quant(
2025-05-07T20:33:00.3925815Z         self,
2025-05-07T20:33:00.3926020Z         T: int,
2025-05-07T20:33:00.3926216Z         D: int,
2025-05-07T20:33:00.3926439Z         scale_ub: Optional[float],
2025-05-07T20:33:00.3926721Z         contiguous: bool,
2025-05-07T20:33:00.3926963Z         compiled: bool,
2025-05-07T20:33:00.3927194Z     ) -> None:
2025-05-07T20:33:00.3927415Z         torch.manual_seed(2025)
2025-05-07T20:33:00.3927654Z     
2025-05-07T20:33:00.3927934Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.3928280Z     
2025-05-07T20:33:00.3928471Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.3928832Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.3930894Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.3932849Z 
2025-05-07T20:33:00.3932973Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:00.3933186Z 
2025-05-07T20:33:00.3933297Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.3933711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.3934116Z     T=128,
2025-05-07T20:33:00.3934308Z     D=5120,
2025-05-07T20:33:00.3934525Z     scale_ub=1200.0,
2025-05-07T20:33:00.3934778Z     contiguous=True,
2025-05-07T20:33:00.3935003Z     compiled=True,
2025-05-07T20:33:00.3935203Z )
2025-05-07T20:33:00.3935536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.3936082Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.3936352Z 
2025-05-07T20:33:00.3936437Z     @given(
2025-05-07T20:33:00.3936699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.3937019Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.3937335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.3937673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.3938002Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.3938382Z     )
2025-05-07T20:33:00.3938733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.3939179Z     def test_silu_mul_quant(
2025-05-07T20:33:00.3939417Z         self,
2025-05-07T20:33:00.3939612Z         T: int,
2025-05-07T20:33:00.3939807Z         D: int,
2025-05-07T20:33:00.3940075Z         scale_ub: Optional[float],
2025-05-07T20:33:00.3940346Z         contiguous: bool,
2025-05-07T20:33:00.3940589Z         compiled: bool,
2025-05-07T20:33:00.3940808Z     ) -> None:
2025-05-07T20:33:00.3941024Z         torch.manual_seed(2025)
2025-05-07T20:33:00.3941266Z     
2025-05-07T20:33:00.3941534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.3941874Z     
2025-05-07T20:33:00.3942070Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.3942362Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.3944437Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.3946344Z 
2025-05-07T20:33:00.3946461Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:00.3946683Z 
2025-05-07T20:33:00.3946785Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.3947205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.3947607Z     T=128,
2025-05-07T20:33:00.3947796Z     D=7168,
2025-05-07T20:33:00.3947991Z     scale_ub=None,
2025-05-07T20:33:00.3948206Z     contiguous=True,
2025-05-07T20:33:00.3948664Z     compiled=True,
2025-05-07T20:33:00.3948873Z )
2025-05-07T20:33:00.5949770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.5950567Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.5950840Z 
2025-05-07T20:33:00.5950929Z     @given(
2025-05-07T20:33:00.5951161Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.5951491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.5951813Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.5952147Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.5952487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.5952781Z     )
2025-05-07T20:33:00.5953143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.5953720Z     def test_silu_mul_quant(
2025-05-07T20:33:00.5953973Z         self,
2025-05-07T20:33:00.5954176Z         T: int,
2025-05-07T20:33:00.5954371Z         D: int,
2025-05-07T20:33:00.5954597Z         scale_ub: Optional[float],
2025-05-07T20:33:00.5954876Z         contiguous: bool,
2025-05-07T20:33:00.5955153Z         compiled: bool,
2025-05-07T20:33:00.5955477Z     ) -> None:
2025-05-07T20:33:00.5956005Z         torch.manual_seed(2025)
2025-05-07T20:33:00.5956250Z     
2025-05-07T20:33:00.5956548Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.5958826Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.5960784Z 
2025-05-07T20:33:00.5960907Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.5961126Z 
2025-05-07T20:33:00.5971185Z FAILED
2025-05-07T20:33:00.5971345Z 
2025-05-07T20:33:00.5971478Z =================================== FAILURES ===================================
2025-05-07T20:33:00.5972137Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:00.5972798Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:00.5973658Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:33:00.5974219Z   |     yield
2025-05-07T20:33:00.5974671Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run
2025-05-07T20:33:00.5975318Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:00.5975902Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
2025-05-07T20:33:00.5976449Z   |     method()
2025-05-07T20:33:00.5977112Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:00.5977905Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.5979006Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:00.5979918Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:00.5980622Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:00.5981316Z   +-+---------------- 1 ----------------
2025-05-07T20:33:00.5981727Z     | Traceback (most recent call last):
2025-05-07T20:33:00.5982731Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:00.5983937Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.5987088Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.5989887Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:00.5990611Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.5991231Z     |     T=2048,
2025-05-07T20:33:00.5991588Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:00.5992066Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:00.5992584Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:00.5993118Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:00.5993566Z     | )
2025-05-07T20:33:00.5993813Z     | 
2025-05-07T20:33:00.5994682Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:00.5995547Z     +---------------- 2 ----------------
2025-05-07T20:33:00.5995956Z     | Traceback (most recent call last):
2025-05-07T20:33:00.5996970Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:00.5998075Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6000989Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.6003866Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:00.6004497Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6005130Z     |     T=128,
2025-05-07T20:33:00.6005477Z     |     D=7168,
2025-05-07T20:33:00.6005772Z     |     scale_ub=None,
2025-05-07T20:33:00.6006107Z     |     contiguous=True,
2025-05-07T20:33:00.6006458Z     |     compiled=True,
2025-05-07T20:33:00.6006767Z     | )
2025-05-07T20:33:00.6007018Z     | 
2025-05-07T20:33:00.6007760Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:00.6008610Z     +---------------- 3 ----------------
2025-05-07T20:33:00.6009020Z     | Traceback (most recent call last):
2025-05-07T20:33:00.6010016Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:00.6011126Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6014017Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.6016761Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:00.6017303Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6017722Z     |     T=128,
2025-05-07T20:33:00.6017932Z     |     D=5120,
2025-05-07T20:33:00.6018212Z     |     scale_ub=1200.0,
2025-05-07T20:33:00.6018462Z     |     contiguous=True,
2025-05-07T20:33:00.6018714Z     |     compiled=True,
2025-05-07T20:33:00.6018937Z     | )
2025-05-07T20:33:00.6019126Z     | 
2025-05-07T20:33:00.6019663Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:00.6020353Z     +---------------- 4 ----------------
2025-05-07T20:33:00.6020650Z     | Traceback (most recent call last):
2025-05-07T20:33:00.6021382Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:00.6022106Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.6022766Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:00.6023525Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6024380Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:00.6025245Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.6025862Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:00.6026612Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6027375Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:00.6028163Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6029285Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:33:00.6030467Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6032520Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:00.6033566Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.6034543Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:00.6035420Z     |     fn()
2025-05-07T20:33:00.6036253Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:00.6037156Z     |     self.fn.run(
2025-05-07T20:33:00.6037924Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:00.6038779Z     |     kernel = self.compile(
2025-05-07T20:33:00.6039654Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:00.6040667Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6041690Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:00.6042847Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6043601Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6044222Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.6044598Z     | ^
2025-05-07T20:33:00.6045271Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6046108Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:00.6046694Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:00.6047457Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6048089Z     |     T=1,  # or any other generated value
2025-05-07T20:33:00.6048600Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:00.6049091Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:00.6049621Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:00.6050147Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:00.6050596Z     | )
2025-05-07T20:33:00.6050861Z     | 
2025-05-07T20:33:00.6051627Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:00.6052498Z     +------------------------------------
2025-05-07T20:33:00.6053070Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:00.6053611Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6054210Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6054805Z     T=1,
2025-05-07T20:33:00.6055109Z     D=5120,
2025-05-07T20:33:00.6055391Z     scale_ub=None,
2025-05-07T20:33:00.6056006Z     contiguous=True,
2025-05-07T20:33:00.6056338Z     compiled=True,
2025-05-07T20:33:00.6056633Z )
2025-05-07T20:33:00.6057107Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6057820Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.6058322Z 
2025-05-07T20:33:00.6058452Z     @given(
2025-05-07T20:33:00.6058797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6059245Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6059874Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6060350Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6060827Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6061248Z     )
2025-05-07T20:33:00.6061751Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6062390Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6062745Z         self,
2025-05-07T20:33:00.6063034Z         T: int,
2025-05-07T20:33:00.6063314Z         D: int,
2025-05-07T20:33:00.6063635Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6064041Z         contiguous: bool,
2025-05-07T20:33:00.6064387Z         compiled: bool,
2025-05-07T20:33:00.6064716Z     ) -> None:
2025-05-07T20:33:00.6065040Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6065401Z     
2025-05-07T20:33:00.6065796Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6066287Z     
2025-05-07T20:33:00.6066561Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6066982Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6067417Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6067739Z         x0 = x[:, :D]
2025-05-07T20:33:00.6068058Z         x1 = x[:, D:]
2025-05-07T20:33:00.6068361Z     
2025-05-07T20:33:00.6068633Z         if contiguous:
2025-05-07T20:33:00.6068968Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6069357Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6069703Z     
2025-05-07T20:33:00.6069990Z         if scale_ub is not None:
2025-05-07T20:33:00.6070394Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6070883Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6071335Z             )
2025-05-07T20:33:00.6071709Z         else:
2025-05-07T20:33:00.6072018Z             scale_ub_tensor = None
2025-05-07T20:33:00.6072367Z     
2025-05-07T20:33:00.6072691Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6073141Z             op = silu_mul_quant
2025-05-07T20:33:00.6073507Z             if compiled:
2025-05-07T20:33:00.6073873Z                 op = torch.compile(op)
2025-05-07T20:33:00.6074305Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6074695Z     
2025-05-07T20:33:00.6074980Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.6075399Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.6075878Z     
2025-05-07T20:33:00.6076209Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6076671Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.6077085Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.6077531Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.6078041Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6078482Z     
2025-05-07T20:33:00.6078760Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.6079043Z 
2025-05-07T20:33:00.6079182Z moe/activation_test.py:126: 
2025-05-07T20:33:00.6079668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6080121Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.6080579Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6081661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.6082739Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.6083483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6084450Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6085504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.6086580Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6087664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.6088724Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6089743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.6090624Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.6091469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.6092189Z     fn()
2025-05-07T20:33:00.6092889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.6093670Z     self.fn.run(
2025-05-07T20:33:00.6094306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6095081Z     kernel = self.compile(
2025-05-07T20:33:00.6095841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6096738Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6097283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6097611Z 
2025-05-07T20:33:00.6097919Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6e124ff10>
2025-05-07T20:33:00.6099541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6101427Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6e133caf0>}
2025-05-07T20:33:00.6125415Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6126860Z context = <triton._C.libtriton.ir.context object at 0x7fd71037bbb0>
2025-05-07T20:33:00.6127387Z 
2025-05-07T20:33:00.6127615Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6128331Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6128962Z                            module_map=module_map)
2025-05-07T20:33:00.6129452Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6129952Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.6130336Z E       ^
2025-05-07T20:33:00.6130980Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6131617Z 
2025-05-07T20:33:00.6132263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6132996Z 
2025-05-07T20:33:00.6133154Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6133745Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6134310Z     T=2048,
2025-05-07T20:33:00.6134586Z     D=5120,
2025-05-07T20:33:00.6134871Z     scale_ub=1200.0,
2025-05-07T20:33:00.6135186Z     contiguous=True,
2025-05-07T20:33:00.6135507Z     compiled=False,
2025-05-07T20:33:00.6135807Z )
2025-05-07T20:33:00.6136260Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6136954Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.6137348Z 
2025-05-07T20:33:00.6137469Z     @given(
2025-05-07T20:33:00.6137862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6138433Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6138892Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6139381Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6139862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6140286Z     )
2025-05-07T20:33:00.6140808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6141417Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6141763Z         self,
2025-05-07T20:33:00.6142037Z         T: int,
2025-05-07T20:33:00.6142307Z         D: int,
2025-05-07T20:33:00.6142608Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6142994Z         contiguous: bool,
2025-05-07T20:33:00.6143338Z         compiled: bool,
2025-05-07T20:33:00.6143655Z     ) -> None:
2025-05-07T20:33:00.6143965Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6144315Z     
2025-05-07T20:33:00.6144714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6145261Z     
2025-05-07T20:33:00.6145530Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6145943Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6146390Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6146740Z         x0 = x[:, :D]
2025-05-07T20:33:00.6147036Z         x1 = x[:, D:]
2025-05-07T20:33:00.6147337Z     
2025-05-07T20:33:00.6147599Z         if contiguous:
2025-05-07T20:33:00.6147914Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6148276Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6148605Z     
2025-05-07T20:33:00.6148880Z         if scale_ub is not None:
2025-05-07T20:33:00.6149281Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6149839Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6150288Z             )
2025-05-07T20:33:00.6150578Z         else:
2025-05-07T20:33:00.6150895Z             scale_ub_tensor = None
2025-05-07T20:33:00.6151262Z     
2025-05-07T20:33:00.6151615Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6152054Z             op = silu_mul_quant
2025-05-07T20:33:00.6152404Z             if compiled:
2025-05-07T20:33:00.6152746Z                 op = torch.compile(op)
2025-05-07T20:33:00.6153171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6153561Z     
2025-05-07T20:33:00.6153899Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6154144Z 
2025-05-07T20:33:00.6154284Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6154694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6155191Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6155873Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6156904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6157929Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6158843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6159804Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6160763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6161506Z     kernel = self.compile(
2025-05-07T20:33:00.6162219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6163177Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6163760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6164097Z 
2025-05-07T20:33:00.6164402Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6e11259f0>
2025-05-07T20:33:00.6165900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6167867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6e1219990>}
2025-05-07T20:33:00.6169794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6171246Z context = <triton._C.libtriton.ir.context object at 0x7fd6e18e84b0>
2025-05-07T20:33:00.6171653Z 
2025-05-07T20:33:00.6171891Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6172640Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6173313Z                            module_map=module_map)
2025-05-07T20:33:00.6173839Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6174338Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6174715Z E       ^
2025-05-07T20:33:00.6175438Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6175986Z 
2025-05-07T20:33:00.6176422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6176938Z 
2025-05-07T20:33:00.6177044Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6177464Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6177873Z     T=2048,
2025-05-07T20:33:00.6178258Z     D=5120,
2025-05-07T20:33:00.6178458Z     scale_ub=1200.0,
2025-05-07T20:33:00.6178689Z     contiguous=True,
2025-05-07T20:33:00.6178910Z     compiled=True,
2025-05-07T20:33:00.6179119Z )
2025-05-07T20:33:00.6179455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6179951Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.6180227Z 
2025-05-07T20:33:00.6180305Z     @given(
2025-05-07T20:33:00.6180537Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6180855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6181229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6181562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6181900Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6182184Z     )
2025-05-07T20:33:00.6182543Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6182987Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6183227Z         self,
2025-05-07T20:33:00.6183428Z         T: int,
2025-05-07T20:33:00.6183634Z         D: int,
2025-05-07T20:33:00.6183855Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6184175Z         contiguous: bool,
2025-05-07T20:33:00.6184422Z         compiled: bool,
2025-05-07T20:33:00.6184650Z     ) -> None:
2025-05-07T20:33:00.6184868Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6185137Z     
2025-05-07T20:33:00.6185449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6185791Z     
2025-05-07T20:33:00.6185989Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6186287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6186596Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6186841Z         x0 = x[:, :D]
2025-05-07T20:33:00.6187065Z         x1 = x[:, D:]
2025-05-07T20:33:00.6187270Z     
2025-05-07T20:33:00.6187463Z         if contiguous:
2025-05-07T20:33:00.6187699Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6187955Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6188247Z     
2025-05-07T20:33:00.6188443Z         if scale_ub is not None:
2025-05-07T20:33:00.6188719Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6189066Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6189373Z             )
2025-05-07T20:33:00.6189562Z         else:
2025-05-07T20:33:00.6189775Z             scale_ub_tensor = None
2025-05-07T20:33:00.6190031Z     
2025-05-07T20:33:00.6190270Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6190583Z             op = silu_mul_quant
2025-05-07T20:33:00.6190834Z             if compiled:
2025-05-07T20:33:00.6191088Z                 op = torch.compile(op)
2025-05-07T20:33:00.6191383Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6191656Z     
2025-05-07T20:33:00.6191855Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.6192141Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.6192431Z     
2025-05-07T20:33:00.6192675Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6193008Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.6193307Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.6193629Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.6193994Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6194303Z     
2025-05-07T20:33:00.6194517Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.6194718Z 
2025-05-07T20:33:00.6194826Z moe/activation_test.py:126: 
2025-05-07T20:33:00.6195120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6195454Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.6195788Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6196628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.6197399Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.6197959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6198651Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6199344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.6200122Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6200888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.6201645Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6202377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.6203025Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.6203677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.6204199Z     fn()
2025-05-07T20:33:00.6204717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.6205307Z     self.fn.run(
2025-05-07T20:33:00.6205791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6206324Z     kernel = self.compile(
2025-05-07T20:33:00.6206873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6207540Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6207934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6208165Z 
2025-05-07T20:33:00.6208451Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6e124d9c0>
2025-05-07T20:33:00.6209549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6210946Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6dbc2d3f0>}
2025-05-07T20:33:00.6212308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6213344Z context = <triton._C.libtriton.ir.context object at 0x7fd6dbb407b0>
2025-05-07T20:33:00.6213644Z 
2025-05-07T20:33:00.6213813Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6214342Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6214814Z                            module_map=module_map)
2025-05-07T20:33:00.6215177Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6215537Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.6215808Z E       ^
2025-05-07T20:33:00.6216286Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6216745Z 
2025-05-07T20:33:00.6217165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6217687Z 
2025-05-07T20:33:00.6217793Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6218373Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6218783Z     T=16384,
2025-05-07T20:33:00.6218975Z     D=7168,
2025-05-07T20:33:00.6219180Z     scale_ub=1200.0,
2025-05-07T20:33:00.6219412Z     contiguous=False,
2025-05-07T20:33:00.6219639Z     compiled=False,
2025-05-07T20:33:00.6219846Z )
2025-05-07T20:33:00.6220174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6220674Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.6220963Z 
2025-05-07T20:33:00.6221042Z     @given(
2025-05-07T20:33:00.6221322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6221632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6221947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6222279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6222610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6222900Z     )
2025-05-07T20:33:00.6223257Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6223704Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6223943Z         self,
2025-05-07T20:33:00.6224137Z         T: int,
2025-05-07T20:33:00.6224381Z         D: int,
2025-05-07T20:33:00.6224596Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6224871Z         contiguous: bool,
2025-05-07T20:33:00.6225110Z         compiled: bool,
2025-05-07T20:33:00.6225329Z     ) -> None:
2025-05-07T20:33:00.6225548Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6225795Z     
2025-05-07T20:33:00.6226066Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6226412Z     
2025-05-07T20:33:00.6226609Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6226906Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6227209Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6227454Z         x0 = x[:, :D]
2025-05-07T20:33:00.6227674Z         x1 = x[:, D:]
2025-05-07T20:33:00.6227877Z     
2025-05-07T20:33:00.6228064Z         if contiguous:
2025-05-07T20:33:00.6228351Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6228607Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6228849Z     
2025-05-07T20:33:00.6229046Z         if scale_ub is not None:
2025-05-07T20:33:00.6229316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6229658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6229970Z             )
2025-05-07T20:33:00.6230158Z         else:
2025-05-07T20:33:00.6230373Z             scale_ub_tensor = None
2025-05-07T20:33:00.6230632Z     
2025-05-07T20:33:00.6230860Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6231174Z             op = silu_mul_quant
2025-05-07T20:33:00.6231427Z             if compiled:
2025-05-07T20:33:00.6231669Z                 op = torch.compile(op)
2025-05-07T20:33:00.6231972Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6232253Z     
2025-05-07T20:33:00.6232449Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6232614Z 
2025-05-07T20:33:00.6232717Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6233014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6233348Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6233630Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6234341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6235088Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6235635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6236320Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6237042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6237581Z     kernel = self.compile(
2025-05-07T20:33:00.6238127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6238799Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6239196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6239422Z 
2025-05-07T20:33:00.6239638Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6dbdc4280>
2025-05-07T20:33:00.6240768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6242159Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6dbc2ce50>}
2025-05-07T20:33:00.6243523Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6244606Z context = <triton._C.libtriton.ir.context object at 0x7fd6dbb6a030>
2025-05-07T20:33:00.6244899Z 
2025-05-07T20:33:00.6245092Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6245645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6246119Z                            module_map=module_map)
2025-05-07T20:33:00.6246490Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6246843Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6247104Z E       ^
2025-05-07T20:33:00.6247578Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6248036Z 
2025-05-07T20:33:00.6248461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6249024Z 
2025-05-07T20:33:00.6249131Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6249554Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6249958Z     T=1,
2025-05-07T20:33:00.6250137Z     D=7168,
2025-05-07T20:33:00.6250334Z     scale_ub=None,
2025-05-07T20:33:00.6250554Z     contiguous=True,
2025-05-07T20:33:00.6250782Z     compiled=True,
2025-05-07T20:33:00.6250993Z )
2025-05-07T20:33:00.6251317Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6251808Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.6252071Z 
2025-05-07T20:33:00.6252149Z     @given(
2025-05-07T20:33:00.6252390Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6252705Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6253009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6253348Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6253686Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6253969Z     )
2025-05-07T20:33:00.6254325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6254795Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6255069Z         self,
2025-05-07T20:33:00.6255263Z         T: int,
2025-05-07T20:33:00.6255469Z         D: int,
2025-05-07T20:33:00.6255972Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6256256Z         contiguous: bool,
2025-05-07T20:33:00.6256504Z         compiled: bool,
2025-05-07T20:33:00.6256728Z     ) -> None:
2025-05-07T20:33:00.6256946Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6257195Z     
2025-05-07T20:33:00.6257564Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6257905Z     
2025-05-07T20:33:00.6258200Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6258499Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6258801Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6259041Z         x0 = x[:, :D]
2025-05-07T20:33:00.6259258Z         x1 = x[:, D:]
2025-05-07T20:33:00.6259459Z     
2025-05-07T20:33:00.6259652Z         if contiguous:
2025-05-07T20:33:00.6259881Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6260137Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6260446Z     
2025-05-07T20:33:00.6260639Z         if scale_ub is not None:
2025-05-07T20:33:00.6260908Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6261246Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6261553Z             )
2025-05-07T20:33:00.6261750Z         else:
2025-05-07T20:33:00.6261956Z             scale_ub_tensor = None
2025-05-07T20:33:00.6262212Z     
2025-05-07T20:33:00.6262448Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6262755Z             op = silu_mul_quant
2025-05-07T20:33:00.6263008Z             if compiled:
2025-05-07T20:33:00.6263258Z                 op = torch.compile(op)
2025-05-07T20:33:00.6263617Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6263892Z     
2025-05-07T20:33:00.6264091Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.6264373Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.6264663Z     
2025-05-07T20:33:00.6264929Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6265291Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.6265589Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.6265905Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.6266271Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6266581Z     
2025-05-07T20:33:00.6266785Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.6266982Z 
2025-05-07T20:33:00.6267088Z moe/activation_test.py:126: 
2025-05-07T20:33:00.6267455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6267792Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.6268127Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6268924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.6269683Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.6270238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6270928Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6271623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.6272352Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6273118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.6273877Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6274608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.6275313Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.6275921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.6276450Z     fn()
2025-05-07T20:33:00.6276959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.6277595Z     self.fn.run(
2025-05-07T20:33:00.6278074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6278610Z     kernel = self.compile(
2025-05-07T20:33:00.6279161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6279830Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6280227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6280453Z 
2025-05-07T20:33:00.6280708Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6dbf5e6e0>
2025-05-07T20:33:00.6281805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6283199Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6db9c09d0>}
2025-05-07T20:33:00.6284627Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6285664Z context = <triton._C.libtriton.ir.context object at 0x7fd6dbaf0cf0>
2025-05-07T20:33:00.6285962Z 
2025-05-07T20:33:00.6286133Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6286665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6287137Z                            module_map=module_map)
2025-05-07T20:33:00.6287501Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6287864Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.6288135Z E       ^
2025-05-07T20:33:00.6288597Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6289101Z 
2025-05-07T20:33:00.6289526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6290048Z 
2025-05-07T20:33:00.6290152Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6290572Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6290969Z     T=4096,
2025-05-07T20:33:00.6291164Z     D=5120,
2025-05-07T20:33:00.6291364Z     scale_ub=None,
2025-05-07T20:33:00.6291578Z     contiguous=False,
2025-05-07T20:33:00.6291809Z     compiled=False,
2025-05-07T20:33:00.6292016Z )
2025-05-07T20:33:00.6300207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6300762Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.6301045Z 
2025-05-07T20:33:00.6301121Z     @given(
2025-05-07T20:33:00.6301355Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6301680Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6301994Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6302331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6302666Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6302957Z     )
2025-05-07T20:33:00.6303309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6303763Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6304007Z         self,
2025-05-07T20:33:00.6304195Z         T: int,
2025-05-07T20:33:00.6304395Z         D: int,
2025-05-07T20:33:00.6304619Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6304887Z         contiguous: bool,
2025-05-07T20:33:00.6305164Z         compiled: bool,
2025-05-07T20:33:00.6305403Z     ) -> None:
2025-05-07T20:33:00.6305696Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6305944Z     
2025-05-07T20:33:00.6306223Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6306568Z     
2025-05-07T20:33:00.6306769Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6307070Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6307384Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6307623Z         x0 = x[:, :D]
2025-05-07T20:33:00.6307842Z         x1 = x[:, D:]
2025-05-07T20:33:00.6308054Z     
2025-05-07T20:33:00.6308238Z         if contiguous:
2025-05-07T20:33:00.6308527Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6308789Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6309029Z     
2025-05-07T20:33:00.6309232Z         if scale_ub is not None:
2025-05-07T20:33:00.6309513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6309847Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6310163Z             )
2025-05-07T20:33:00.6310359Z         else:
2025-05-07T20:33:00.6310569Z             scale_ub_tensor = None
2025-05-07T20:33:00.6310824Z     
2025-05-07T20:33:00.6311064Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6311376Z             op = silu_mul_quant
2025-05-07T20:33:00.6311675Z             if compiled:
2025-05-07T20:33:00.6311926Z                 op = torch.compile(op)
2025-05-07T20:33:00.6312229Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6312501Z     
2025-05-07T20:33:00.6312696Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6312863Z 
2025-05-07T20:33:00.6312972Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6313264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6313598Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6313890Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6314594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6315303Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6315852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6316597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6317267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6317806Z     kernel = self.compile(
2025-05-07T20:33:00.6318357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6319027Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6319425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6319651Z 
2025-05-07T20:33:00.6319869Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6db9fd1e0>
2025-05-07T20:33:00.6320671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6321188Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6db9c1a20>}
2025-05-07T20:33:00.6321956Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6322154Z context = <triton._C.libtriton.ir.context object at 0x7fd6db8235b0>
2025-05-07T20:33:00.6322158Z 
2025-05-07T20:33:00.6322329Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6322644Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6322753Z                            module_map=module_map)
2025-05-07T20:33:00.6322924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6323026Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6323106Z E       ^
2025-05-07T20:33:00.6323469Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6323474Z 
2025-05-07T20:33:00.6323896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6323941Z 
2025-05-07T20:33:00.6324053Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6324279Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6324361Z     T=4096,
2025-05-07T20:33:00.6324437Z     D=7168,
2025-05-07T20:33:00.6324520Z     scale_ub=None,
2025-05-07T20:33:00.6324616Z     contiguous=False,
2025-05-07T20:33:00.6324700Z     compiled=False,
2025-05-07T20:33:00.6324774Z )
2025-05-07T20:33:00.6325037Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6325268Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.6325273Z 
2025-05-07T20:33:00.6325352Z     @given(
2025-05-07T20:33:00.6325477Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6325576Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6325693Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6325819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6325934Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6326015Z     )
2025-05-07T20:33:00.6326264Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6326358Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6326442Z         self,
2025-05-07T20:33:00.6326517Z         T: int,
2025-05-07T20:33:00.6326592Z         D: int,
2025-05-07T20:33:00.6326697Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6326829Z         contiguous: bool,
2025-05-07T20:33:00.6326913Z         compiled: bool,
2025-05-07T20:33:00.6326997Z     ) -> None:
2025-05-07T20:33:00.6327093Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6327165Z     
2025-05-07T20:33:00.6327340Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6327412Z     
2025-05-07T20:33:00.6327511Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6327639Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6327726Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6327807Z         x0 = x[:, :D]
2025-05-07T20:33:00.6327888Z         x1 = x[:, D:]
2025-05-07T20:33:00.6327960Z     
2025-05-07T20:33:00.6328046Z         if contiguous:
2025-05-07T20:33:00.6328135Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6328224Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6328301Z     
2025-05-07T20:33:00.6328390Z         if scale_ub is not None:
2025-05-07T20:33:00.6328497Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6328638Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6328711Z             )
2025-05-07T20:33:00.6328789Z         else:
2025-05-07T20:33:00.6328881Z             scale_ub_tensor = None
2025-05-07T20:33:00.6328952Z     
2025-05-07T20:33:00.6329085Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6329173Z             op = silu_mul_quant
2025-05-07T20:33:00.6329260Z             if compiled:
2025-05-07T20:33:00.6329362Z                 op = torch.compile(op)
2025-05-07T20:33:00.6329468Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6329540Z     
2025-05-07T20:33:00.6329632Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6329637Z 
2025-05-07T20:33:00.6329733Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6329908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6330009Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6330111Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6330627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6330724Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6331087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6331357Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6331704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6331798Z     kernel = self.compile(
2025-05-07T20:33:00.6332191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6332369Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6332500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6332505Z 
2025-05-07T20:33:00.6332793Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6db8ab760>
2025-05-07T20:33:00.6333594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6334109Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6db9c2560>}
2025-05-07T20:33:00.6334873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6335087Z context = <triton._C.libtriton.ir.context object at 0x7fd6db8ba770>
2025-05-07T20:33:00.6335141Z 
2025-05-07T20:33:00.6335332Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6335608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6335715Z                            module_map=module_map)
2025-05-07T20:33:00.6335879Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6335990Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6336064Z E       ^
2025-05-07T20:33:00.6336427Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6336437Z 
2025-05-07T20:33:00.6336860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6336865Z 
2025-05-07T20:33:00.6336968Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6337195Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6337273Z     T=128,
2025-05-07T20:33:00.6337348Z     D=7168,
2025-05-07T20:33:00.6337436Z     scale_ub=None,
2025-05-07T20:33:00.6337521Z     contiguous=False,
2025-05-07T20:33:00.6337603Z     compiled=True,
2025-05-07T20:33:00.6337678Z )
2025-05-07T20:33:00.6337898Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6338162Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.6338170Z 
2025-05-07T20:33:00.6338246Z     @given(
2025-05-07T20:33:00.6338364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6338466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6338581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6338742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6338862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6338933Z     )
2025-05-07T20:33:00.6339185Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6339285Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6339359Z         self,
2025-05-07T20:33:00.6339438Z         T: int,
2025-05-07T20:33:00.6339512Z         D: int,
2025-05-07T20:33:00.6339608Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6339699Z         contiguous: bool,
2025-05-07T20:33:00.6339783Z         compiled: bool,
2025-05-07T20:33:00.6339941Z     ) -> None:
2025-05-07T20:33:00.6340041Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6340116Z     
2025-05-07T20:33:00.6340285Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6340361Z     
2025-05-07T20:33:00.6340453Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6340581Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6340671Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6340749Z         x0 = x[:, :D]
2025-05-07T20:33:00.6340831Z         x1 = x[:, D:]
2025-05-07T20:33:00.6340908Z     
2025-05-07T20:33:00.6340990Z         if contiguous:
2025-05-07T20:33:00.6341124Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6341212Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6341280Z     
2025-05-07T20:33:00.6341375Z         if scale_ub is not None:
2025-05-07T20:33:00.6341481Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6341620Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6341700Z             )
2025-05-07T20:33:00.6341775Z         else:
2025-05-07T20:33:00.6341866Z             scale_ub_tensor = None
2025-05-07T20:33:00.6341940Z     
2025-05-07T20:33:00.6342069Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6342155Z             op = silu_mul_quant
2025-05-07T20:33:00.6342246Z             if compiled:
2025-05-07T20:33:00.6342345Z                 op = torch.compile(op)
2025-05-07T20:33:00.6342453Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6342565Z     
2025-05-07T20:33:00.6342654Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.6342782Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.6342854Z     
2025-05-07T20:33:00.6342992Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6343098Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.6343197Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.6343323Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.6343467Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6343538Z     
2025-05-07T20:33:00.6343642Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.6343647Z 
2025-05-07T20:33:00.6343744Z moe/activation_test.py:126: 
2025-05-07T20:33:00.6343872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6343982Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.6344117Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6344694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.6344807Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.6345174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6345404Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6345778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.6346037Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6346488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.6346746Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6347136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.6347308Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.6347657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.6347779Z     fn()
2025-05-07T20:33:00.6348189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.6348268Z     self.fn.run(
2025-05-07T20:33:00.6348613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6348709Z     kernel = self.compile(
2025-05-07T20:33:00.6349099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6349278Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6349442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6349448Z 
2025-05-07T20:33:00.6349661Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6db593640>
2025-05-07T20:33:00.6350458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6350976Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6db9ee3b0>}
2025-05-07T20:33:00.6351742Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6351977Z context = <triton._C.libtriton.ir.context object at 0x7fd6db13c530>
2025-05-07T20:33:00.6351989Z 
2025-05-07T20:33:00.6352157Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6352425Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6352535Z                            module_map=module_map)
2025-05-07T20:33:00.6352699Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6352802Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.6352881Z E       ^
2025-05-07T20:33:00.6353242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6353246Z 
2025-05-07T20:33:00.6353674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6353679Z 
2025-05-07T20:33:00.6353784Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6354011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6354090Z     T=128,
2025-05-07T20:33:00.6354165Z     D=7168,
2025-05-07T20:33:00.6354244Z     scale_ub=None,
2025-05-07T20:33:00.6354334Z     contiguous=False,
2025-05-07T20:33:00.6354417Z     compiled=False,
2025-05-07T20:33:00.6354486Z )
2025-05-07T20:33:00.6354712Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6354910Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.6354915Z 
2025-05-07T20:33:00.6355002Z     @given(
2025-05-07T20:33:00.6355138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6355236Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6355398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6355516Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6355876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6355999Z     )
2025-05-07T20:33:00.6356255Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6356347Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6356429Z         self,
2025-05-07T20:33:00.6356502Z         T: int,
2025-05-07T20:33:00.6356581Z         D: int,
2025-05-07T20:33:00.6356679Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6356859Z         contiguous: bool,
2025-05-07T20:33:00.6356947Z         compiled: bool,
2025-05-07T20:33:00.6357023Z     ) -> None:
2025-05-07T20:33:00.6357117Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6357197Z     
2025-05-07T20:33:00.6357369Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6357439Z     
2025-05-07T20:33:00.6357537Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6357666Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6357755Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6357838Z         x0 = x[:, :D]
2025-05-07T20:33:00.6357914Z         x1 = x[:, D:]
2025-05-07T20:33:00.6358066Z     
2025-05-07T20:33:00.6358151Z         if contiguous:
2025-05-07T20:33:00.6358241Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6358331Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6358401Z     
2025-05-07T20:33:00.6358489Z         if scale_ub is not None:
2025-05-07T20:33:00.6358603Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6358736Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6358808Z             )
2025-05-07T20:33:00.6358888Z         else:
2025-05-07T20:33:00.6358980Z             scale_ub_tensor = None
2025-05-07T20:33:00.6359049Z     
2025-05-07T20:33:00.6359184Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6359272Z             op = silu_mul_quant
2025-05-07T20:33:00.6359354Z             if compiled:
2025-05-07T20:33:00.6359455Z                 op = torch.compile(op)
2025-05-07T20:33:00.6359628Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6359703Z     
2025-05-07T20:33:00.6359792Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6359797Z 
2025-05-07T20:33:00.6359893Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6360023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6360122Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6360226Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6360741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6360838Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6361209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6361431Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6361790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6361885Z     kernel = self.compile(
2025-05-07T20:33:00.6362276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6362456Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6362584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6362589Z 
2025-05-07T20:33:00.6362797Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6db5f78e0>
2025-05-07T20:33:00.6363659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6364174Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6dba2e830>}
2025-05-07T20:33:00.6364946Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6365159Z context = <triton._C.libtriton.ir.context object at 0x7fd6db1c25f0>
2025-05-07T20:33:00.6365209Z 
2025-05-07T20:33:00.6365406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6365675Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6365783Z                            module_map=module_map)
2025-05-07T20:33:00.6365953Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6366054Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6366131Z E       ^
2025-05-07T20:33:00.6366496Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6366542Z 
2025-05-07T20:33:00.6366969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6366974Z 
2025-05-07T20:33:00.6367085Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6367311Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6367389Z     T=4096,
2025-05-07T20:33:00.6367473Z     D=5120,
2025-05-07T20:33:00.6367557Z     scale_ub=1200.0,
2025-05-07T20:33:00.6367642Z     contiguous=True,
2025-05-07T20:33:00.6367727Z     compiled=False,
2025-05-07T20:33:00.6367801Z )
2025-05-07T20:33:00.6368034Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6368211Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.6368216Z 
2025-05-07T20:33:00.6368337Z     @given(
2025-05-07T20:33:00.6368461Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6368561Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6368680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6368803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6368921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6368995Z     )
2025-05-07T20:33:00.6369248Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6369343Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6369422Z         self,
2025-05-07T20:33:00.6369497Z         T: int,
2025-05-07T20:33:00.6369571Z         D: int,
2025-05-07T20:33:00.6369675Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6369764Z         contiguous: bool,
2025-05-07T20:33:00.6369850Z         compiled: bool,
2025-05-07T20:33:00.6369933Z     ) -> None:
2025-05-07T20:33:00.6370028Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6370105Z     
2025-05-07T20:33:00.6370283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6370355Z     
2025-05-07T20:33:00.6370447Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6370577Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6370665Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6370749Z         x0 = x[:, :D]
2025-05-07T20:33:00.6370831Z         x1 = x[:, D:]
2025-05-07T20:33:00.6370903Z     
2025-05-07T20:33:00.6370991Z         if contiguous:
2025-05-07T20:33:00.6371080Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6371170Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6371248Z     
2025-05-07T20:33:00.6371338Z         if scale_ub is not None:
2025-05-07T20:33:00.6371510Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6371653Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6371729Z             )
2025-05-07T20:33:00.6371807Z         else:
2025-05-07T20:33:00.6371905Z             scale_ub_tensor = None
2025-05-07T20:33:00.6371974Z     
2025-05-07T20:33:00.6372109Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6372198Z             op = silu_mul_quant
2025-05-07T20:33:00.6372280Z             if compiled:
2025-05-07T20:33:00.6372384Z                 op = torch.compile(op)
2025-05-07T20:33:00.6372492Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6372604Z     
2025-05-07T20:33:00.6372700Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6372704Z 
2025-05-07T20:33:00.6372802Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6372930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6373036Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6373139Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6373654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6373755Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6374159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6374388Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6374736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6374838Z     kernel = self.compile(
2025-05-07T20:33:00.6375230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6375405Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6375534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6375538Z 
2025-05-07T20:33:00.6375745Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6db4ae800>
2025-05-07T20:33:00.6376579Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6377101Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6dba2db40>}
2025-05-07T20:33:00.6377867Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6378161Z context = <triton._C.libtriton.ir.context object at 0x7fd6db1f76b0>
2025-05-07T20:33:00.6378167Z 
2025-05-07T20:33:00.6378337Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6378612Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6378724Z                            module_map=module_map)
2025-05-07T20:33:00.6378890Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6378995Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6379071Z E       ^
2025-05-07T20:33:00.6379430Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6379437Z 
2025-05-07T20:33:00.6379865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6379870Z 
2025-05-07T20:33:00.6379974Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6380206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6380327Z     T=1,
2025-05-07T20:33:00.6380404Z     D=5120,
2025-05-07T20:33:00.6380493Z     scale_ub=None,
2025-05-07T20:33:00.6380579Z     contiguous=True,
2025-05-07T20:33:00.6380664Z     compiled=True,
2025-05-07T20:33:00.6380741Z )
2025-05-07T20:33:00.6380964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6381126Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.6381130Z 
2025-05-07T20:33:00.6381214Z     @given(
2025-05-07T20:33:00.6381334Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6381481Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6381597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6381718Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6381836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6381909Z     )
2025-05-07T20:33:00.6382160Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6382258Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6382334Z         self,
2025-05-07T20:33:00.6382411Z         T: int,
2025-05-07T20:33:00.6382493Z         D: int,
2025-05-07T20:33:00.6382591Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6382722Z         contiguous: bool,
2025-05-07T20:33:00.6382809Z         compiled: bool,
2025-05-07T20:33:00.6382892Z     ) -> None:
2025-05-07T20:33:00.6382994Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6383065Z     
2025-05-07T20:33:00.6383235Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6383317Z     
2025-05-07T20:33:00.6383413Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6383539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6383630Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6383710Z         x0 = x[:, :D]
2025-05-07T20:33:00.6383789Z         x1 = x[:, D:]
2025-05-07T20:33:00.6383862Z     
2025-05-07T20:33:00.6383948Z         if contiguous:
2025-05-07T20:33:00.6384040Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6384130Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6384245Z     
2025-05-07T20:33:00.6384340Z         if scale_ub is not None:
2025-05-07T20:33:00.6384449Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6384587Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6384666Z             )
2025-05-07T20:33:00.6384743Z         else:
2025-05-07T20:33:00.6384859Z             scale_ub_tensor = None
2025-05-07T20:33:00.6384941Z     
2025-05-07T20:33:00.6385093Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6385183Z             op = silu_mul_quant
2025-05-07T20:33:00.6385269Z             if compiled:
2025-05-07T20:33:00.6385368Z                 op = torch.compile(op)
2025-05-07T20:33:00.6385472Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6385548Z     
2025-05-07T20:33:00.6385641Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.6385768Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.6385838Z     
2025-05-07T20:33:00.6385978Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6386088Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.6386188Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.6386310Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.6386454Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6386526Z     
2025-05-07T20:33:00.6386628Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.6386633Z 
2025-05-07T20:33:00.6386737Z moe/activation_test.py:126: 
2025-05-07T20:33:00.6386869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6386984Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.6387121Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6387738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.6387851Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.6388218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6388440Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6388822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.6389118Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6389528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.6389788Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6390172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.6390349Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.6390737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.6390822Z     fn()
2025-05-07T20:33:00.6391232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.6391313Z     self.fn.run(
2025-05-07T20:33:00.6391665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6391758Z     kernel = self.compile(
2025-05-07T20:33:00.6392143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6392329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6392454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6392499Z 
2025-05-07T20:33:00.6392712Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6db8a80d0>
2025-05-07T20:33:00.6393507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6394022Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6dba2f250>}
2025-05-07T20:33:00.6394789Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6394988Z context = <triton._C.libtriton.ir.context object at 0x7fd6dac0b630>
2025-05-07T20:33:00.6394993Z 
2025-05-07T20:33:00.6395193Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6395487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6395599Z                            module_map=module_map)
2025-05-07T20:33:00.6395763Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6395867Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.6395947Z E       ^
2025-05-07T20:33:00.6396307Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6396312Z 
2025-05-07T20:33:00.6396734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6396744Z 
2025-05-07T20:33:00.6396847Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6397113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6397194Z     T=2048,
2025-05-07T20:33:00.6397272Z     D=5120,
2025-05-07T20:33:00.6397355Z     scale_ub=None,
2025-05-07T20:33:00.6397445Z     contiguous=True,
2025-05-07T20:33:00.6397534Z     compiled=True,
2025-05-07T20:33:00.6397606Z )
2025-05-07T20:33:00.6397833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6398002Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.6398007Z 
2025-05-07T20:33:00.6398125Z     @given(
2025-05-07T20:33:00.6398252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6398351Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6398471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6398589Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6398705Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6398783Z     )
2025-05-07T20:33:00.6399031Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6399127Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6399205Z         self,
2025-05-07T20:33:00.6399321Z         T: int,
2025-05-07T20:33:00.6399399Z         D: int,
2025-05-07T20:33:00.6399501Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6399589Z         contiguous: bool,
2025-05-07T20:33:00.6399678Z         compiled: bool,
2025-05-07T20:33:00.6399756Z     ) -> None:
2025-05-07T20:33:00.6399852Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6399929Z     
2025-05-07T20:33:00.6400101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6400175Z     
2025-05-07T20:33:00.6400271Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6400400Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6400489Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6400577Z         x0 = x[:, :D]
2025-05-07T20:33:00.6400661Z         x1 = x[:, D:]
2025-05-07T20:33:00.6400732Z     
2025-05-07T20:33:00.6400817Z         if contiguous:
2025-05-07T20:33:00.6400975Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6401068Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6401145Z     
2025-05-07T20:33:00.6401235Z         if scale_ub is not None:
2025-05-07T20:33:00.6401346Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6401479Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6401553Z             )
2025-05-07T20:33:00.6401636Z         else:
2025-05-07T20:33:00.6401729Z             scale_ub_tensor = None
2025-05-07T20:33:00.6401801Z     
2025-05-07T20:33:00.6401936Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6402023Z             op = silu_mul_quant
2025-05-07T20:33:00.6402107Z             if compiled:
2025-05-07T20:33:00.6402211Z                 op = torch.compile(op)
2025-05-07T20:33:00.6402317Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6402389Z     
2025-05-07T20:33:00.6402488Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.6402611Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.6402692Z     
2025-05-07T20:33:00.6402831Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6402936Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.6403037Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.6403164Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.6403305Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6403383Z     
2025-05-07T20:33:00.6403484Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.6403488Z 
2025-05-07T20:33:00.6403587Z moe/activation_test.py:126: 
2025-05-07T20:33:00.6403716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6403866Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.6404005Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6404577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.6404683Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.6405052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6405276Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6405698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.6405955Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6406358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.6406620Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6406999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.6407209Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.6407564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.6407641Z     fn()
2025-05-07T20:33:00.6408054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.6408141Z     self.fn.run(
2025-05-07T20:33:00.6408485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6408583Z     kernel = self.compile(
2025-05-07T20:33:00.6408974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6409151Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6409323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6409327Z 
2025-05-07T20:33:00.6409540Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6dac3e080>
2025-05-07T20:33:00.6410338Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6410854Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6db5d3760>}
2025-05-07T20:33:00.6411661Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6411933Z context = <triton._C.libtriton.ir.context object at 0x7fd6dacf48b0>
2025-05-07T20:33:00.6411944Z 
2025-05-07T20:33:00.6412173Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6412455Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6412569Z                            module_map=module_map)
2025-05-07T20:33:00.6412743Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6412852Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.6412932Z E       ^
2025-05-07T20:33:00.6413299Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6413303Z 
2025-05-07T20:33:00.6413782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6413788Z 
2025-05-07T20:33:00.6413903Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6414131Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6414215Z     T=128,
2025-05-07T20:33:00.6414300Z     D=5120,
2025-05-07T20:33:00.6414387Z     scale_ub=None,
2025-05-07T20:33:00.6414475Z     contiguous=True,
2025-05-07T20:33:00.6414568Z     compiled=True,
2025-05-07T20:33:00.6414650Z )
2025-05-07T20:33:00.6414915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6415094Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.6415142Z 
2025-05-07T20:33:00.6415222Z     @given(
2025-05-07T20:33:00.6415344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6415450Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6415569Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6415697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6415814Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6415890Z     )
2025-05-07T20:33:00.6416151Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6416288Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6416368Z         self,
2025-05-07T20:33:00.6416451Z         T: int,
2025-05-07T20:33:00.6416528Z         D: int,
2025-05-07T20:33:00.6416628Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6416723Z         contiguous: bool,
2025-05-07T20:33:00.6416812Z         compiled: bool,
2025-05-07T20:33:00.6416907Z     ) -> None:
2025-05-07T20:33:00.6417004Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6417078Z     
2025-05-07T20:33:00.6417254Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6417330Z     
2025-05-07T20:33:00.6417425Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6417560Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6417651Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6417731Z         x0 = x[:, :D]
2025-05-07T20:33:00.6417868Z         x1 = x[:, D:]
2025-05-07T20:33:00.6417942Z     
2025-05-07T20:33:00.6418026Z         if contiguous:
2025-05-07T20:33:00.6418239Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6418331Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6418405Z     
2025-05-07T20:33:00.6418503Z         if scale_ub is not None:
2025-05-07T20:33:00.6418612Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6418755Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6418834Z             )
2025-05-07T20:33:00.6418911Z         else:
2025-05-07T20:33:00.6419011Z             scale_ub_tensor = None
2025-05-07T20:33:00.6419084Z     
2025-05-07T20:33:00.6419218Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6419315Z             op = silu_mul_quant
2025-05-07T20:33:00.6419406Z             if compiled:
2025-05-07T20:33:00.6419548Z                 op = torch.compile(op)
2025-05-07T20:33:00.6419681Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6419780Z     
2025-05-07T20:33:00.6419909Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.6420047Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.6420120Z     
2025-05-07T20:33:00.6420261Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6420367Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.6420468Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.6420591Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.6420738Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6420812Z     
2025-05-07T20:33:00.6420913Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.6420922Z 
2025-05-07T20:33:00.6421022Z moe/activation_test.py:126: 
2025-05-07T20:33:00.6421201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6421316Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.6421454Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6422031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.6422138Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.6422504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6422771Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6423148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.6423408Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6423821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.6433482Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6434042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.6434257Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.6434619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.6434704Z     fn()
2025-05-07T20:33:00.6435116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.6435204Z     self.fn.run(
2025-05-07T20:33:00.6435551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6435649Z     kernel = self.compile(
2025-05-07T20:33:00.6436047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6436272Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6436403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6436416Z 
2025-05-07T20:33:00.6436631Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6dad85b70>
2025-05-07T20:33:00.6437485Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6438011Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da770280>}
2025-05-07T20:33:00.6438805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6439008Z context = <triton._C.libtriton.ir.context object at 0x7fd6daabb7f0>
2025-05-07T20:33:00.6439015Z 
2025-05-07T20:33:00.6439184Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6439456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6439572Z                            module_map=module_map)
2025-05-07T20:33:00.6439741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6439850Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.6439935Z E       ^
2025-05-07T20:33:00.6440301Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6440306Z 
2025-05-07T20:33:00.6440779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6440784Z 
2025-05-07T20:33:00.6440898Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6441129Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6441217Z     T=4096,
2025-05-07T20:33:00.6441294Z     D=5120,
2025-05-07T20:33:00.6441381Z     scale_ub=None,
2025-05-07T20:33:00.6441467Z     contiguous=True,
2025-05-07T20:33:00.6441550Z     compiled=True,
2025-05-07T20:33:00.6441630Z )
2025-05-07T20:33:00.6441897Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6442069Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.6442074Z 
2025-05-07T20:33:00.6442160Z     @given(
2025-05-07T20:33:00.6442281Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6442383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6442509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6442630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6442756Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6442831Z     )
2025-05-07T20:33:00.6443127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6443231Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6443308Z         self,
2025-05-07T20:33:00.6443385Z         T: int,
2025-05-07T20:33:00.6443465Z         D: int,
2025-05-07T20:33:00.6443566Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6443659Z         contiguous: bool,
2025-05-07T20:33:00.6443751Z         compiled: bool,
2025-05-07T20:33:00.6443831Z     ) -> None:
2025-05-07T20:33:00.6443928Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6444007Z     
2025-05-07T20:33:00.6444181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6444263Z     
2025-05-07T20:33:00.6444357Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6444483Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6444624Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6444706Z         x0 = x[:, :D]
2025-05-07T20:33:00.6444801Z         x1 = x[:, D:]
2025-05-07T20:33:00.6444891Z     
2025-05-07T20:33:00.6444991Z         if contiguous:
2025-05-07T20:33:00.6445099Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6445194Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6445267Z     
2025-05-07T20:33:00.6445360Z         if scale_ub is not None:
2025-05-07T20:33:00.6445476Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6445617Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6445694Z             )
2025-05-07T20:33:00.6445775Z         else:
2025-05-07T20:33:00.6445872Z             scale_ub_tensor = None
2025-05-07T20:33:00.6445952Z     
2025-05-07T20:33:00.6446087Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6446179Z             op = silu_mul_quant
2025-05-07T20:33:00.6446267Z             if compiled:
2025-05-07T20:33:00.6446370Z                 op = torch.compile(op)
2025-05-07T20:33:00.6446477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6446558Z     
2025-05-07T20:33:00.6446649Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.6446772Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.6446852Z     
2025-05-07T20:33:00.6446992Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6447095Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.6447208Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.6447331Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.6447482Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6447556Z     
2025-05-07T20:33:00.6447658Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.6447708Z 
2025-05-07T20:33:00.6447816Z moe/activation_test.py:126: 
2025-05-07T20:33:00.6447947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6448059Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.6448208Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6448786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.6448895Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.6449306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6449581Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6450018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.6450361Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6450896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.6455895Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6456387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.6477307Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.6477689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.6477765Z     fn()
2025-05-07T20:33:00.6478176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.6478257Z     self.fn.run(
2025-05-07T20:33:00.6478602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6478696Z     kernel = self.compile(
2025-05-07T20:33:00.6479083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6479381Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6479509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6479514Z 
2025-05-07T20:33:00.6479723Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6dad877f0>
2025-05-07T20:33:00.6480523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6481039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da7712d0>}
2025-05-07T20:33:00.6481807Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6482005Z context = <triton._C.libtriton.ir.context object at 0x7fd6da992af0>
2025-05-07T20:33:00.6482010Z 
2025-05-07T20:33:00.6482176Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6482445Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6482552Z                            module_map=module_map)
2025-05-07T20:33:00.6482718Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6482819Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.6482892Z E       ^
2025-05-07T20:33:00.6483322Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6483327Z 
2025-05-07T20:33:00.6483749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6483756Z 
2025-05-07T20:33:00.6483861Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6484089Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6484163Z     T=16384,
2025-05-07T20:33:00.6484241Z     D=5120,
2025-05-07T20:33:00.6484320Z     scale_ub=None,
2025-05-07T20:33:00.6484401Z     contiguous=True,
2025-05-07T20:33:00.6484548Z     compiled=True,
2025-05-07T20:33:00.6484623Z )
2025-05-07T20:33:00.6484843Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6485045Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.6485051Z 
2025-05-07T20:33:00.6485126Z     @given(
2025-05-07T20:33:00.6485269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6485372Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6485486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6485608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6485762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6485836Z     )
2025-05-07T20:33:00.6486094Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6486186Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6486258Z         self,
2025-05-07T20:33:00.6486336Z         T: int,
2025-05-07T20:33:00.6486408Z         D: int,
2025-05-07T20:33:00.6486503Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6486596Z         contiguous: bool,
2025-05-07T20:33:00.6486679Z         compiled: bool,
2025-05-07T20:33:00.6486754Z     ) -> None:
2025-05-07T20:33:00.6486851Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6486919Z     
2025-05-07T20:33:00.6487094Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6487165Z     
2025-05-07T20:33:00.6487255Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6487425Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6487514Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6487590Z         x0 = x[:, :D]
2025-05-07T20:33:00.6487676Z         x1 = x[:, D:]
2025-05-07T20:33:00.6487744Z     
2025-05-07T20:33:00.6487827Z         if contiguous:
2025-05-07T20:33:00.6487922Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6488010Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6488083Z     
2025-05-07T20:33:00.6488174Z         if scale_ub is not None:
2025-05-07T20:33:00.6488277Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6488415Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6488490Z             )
2025-05-07T20:33:00.6488563Z         else:
2025-05-07T20:33:00.6488664Z             scale_ub_tensor = None
2025-05-07T20:33:00.6488732Z     
2025-05-07T20:33:00.6488864Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6488955Z             op = silu_mul_quant
2025-05-07T20:33:00.6489037Z             if compiled:
2025-05-07T20:33:00.6489137Z                 op = torch.compile(op)
2025-05-07T20:33:00.6489243Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6489313Z     
2025-05-07T20:33:00.6489400Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.6489522Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.6489590Z     
2025-05-07T20:33:00.6489732Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6489831Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.6489928Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.6490051Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.6490190Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6490331Z     
2025-05-07T20:33:00.6490434Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.6490439Z 
2025-05-07T20:33:00.6490536Z moe/activation_test.py:126: 
2025-05-07T20:33:00.6490664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6490773Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.6490910Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6491490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.6491634Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.6491999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6492228Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6492606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.6492866Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6493313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.6493569Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6493955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.6494128Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.6494478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.6494564Z     fn()
2025-05-07T20:33:00.6494975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.6495065Z     self.fn.run(
2025-05-07T20:33:00.6495411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6495547Z     kernel = self.compile(
2025-05-07T20:33:00.6495944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6496125Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6496252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6496260Z 
2025-05-07T20:33:00.6496476Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da50cca0>
2025-05-07T20:33:00.6497385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6497916Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daaa1510>}
2025-05-07T20:33:00.6498799Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6498997Z context = <triton._C.libtriton.ir.context object at 0x7fd6da272730>
2025-05-07T20:33:00.6499002Z 
2025-05-07T20:33:00.6499174Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6499447Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6499564Z                            module_map=module_map)
2025-05-07T20:33:00.6499727Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6499831Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.6499968Z E       ^
2025-05-07T20:33:00.6500333Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6500340Z 
2025-05-07T20:33:00.6500772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6500777Z 
2025-05-07T20:33:00.6500883Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6501111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6501193Z     T=1,
2025-05-07T20:33:00.6501311Z     D=5120,
2025-05-07T20:33:00.6501395Z     scale_ub=1200.0,
2025-05-07T20:33:00.6501485Z     contiguous=True,
2025-05-07T20:33:00.6501573Z     compiled=True,
2025-05-07T20:33:00.6501647Z )
2025-05-07T20:33:00.6501876Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6502047Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.6502052Z 
2025-05-07T20:33:00.6502134Z     @given(
2025-05-07T20:33:00.6502254Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6502357Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6502518Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6502639Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6502757Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6502839Z     )
2025-05-07T20:33:00.6503091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6503193Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6503278Z         self,
2025-05-07T20:33:00.6503354Z         T: int,
2025-05-07T20:33:00.6503436Z         D: int,
2025-05-07T20:33:00.6503535Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6503625Z         contiguous: bool,
2025-05-07T20:33:00.6503718Z         compiled: bool,
2025-05-07T20:33:00.6503800Z     ) -> None:
2025-05-07T20:33:00.6503899Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6503979Z     
2025-05-07T20:33:00.6504150Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6504267Z     
2025-05-07T20:33:00.6504366Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6504495Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6504585Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6504675Z         x0 = x[:, :D]
2025-05-07T20:33:00.6504767Z         x1 = x[:, D:]
2025-05-07T20:33:00.6504853Z     
2025-05-07T20:33:00.6504955Z         if contiguous:
2025-05-07T20:33:00.6505064Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6505160Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6505232Z     
2025-05-07T20:33:00.6505325Z         if scale_ub is not None:
2025-05-07T20:33:00.6505437Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6505571Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6505651Z             )
2025-05-07T20:33:00.6505736Z         else:
2025-05-07T20:33:00.6505830Z             scale_ub_tensor = None
2025-05-07T20:33:00.6505908Z     
2025-05-07T20:33:00.6506046Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6506137Z             op = silu_mul_quant
2025-05-07T20:33:00.6506222Z             if compiled:
2025-05-07T20:33:00.6506329Z                 op = torch.compile(op)
2025-05-07T20:33:00.6506435Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6506512Z     
2025-05-07T20:33:00.6506602Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6506610Z 
2025-05-07T20:33:00.6506708Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6506842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6506943Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6507045Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6507471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6507566Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6508080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6508183Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6508550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6508781Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6509189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6509285Z     kernel = self.compile(
2025-05-07T20:33:00.6509682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6509865Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6509995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6510000Z 
2025-05-07T20:33:00.6510214Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da59f2b0>
2025-05-07T20:33:00.6511050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6511570Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da692b90>}
2025-05-07T20:33:00.6512342Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6512545Z context = <triton._C.libtriton.ir.context object at 0x7fd6da2dc8b0>
2025-05-07T20:33:00.6512549Z 
2025-05-07T20:33:00.6512718Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6513030Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6513143Z                            module_map=module_map)
2025-05-07T20:33:00.6513307Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6513415Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6513492Z E       ^
2025-05-07T20:33:00.6513853Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6513861Z 
2025-05-07T20:33:00.6514289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6514294Z 
2025-05-07T20:33:00.6514399Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6514638Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6514715Z     T=1,
2025-05-07T20:33:00.6514792Z     D=5120,
2025-05-07T20:33:00.6514883Z     scale_ub=None,
2025-05-07T20:33:00.6514972Z     contiguous=False,
2025-05-07T20:33:00.6515077Z     compiled=True,
2025-05-07T20:33:00.6515162Z )
2025-05-07T20:33:00.6515405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6515570Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.6515575Z 
2025-05-07T20:33:00.6515658Z     @given(
2025-05-07T20:33:00.6515779Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6515885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6516001Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6516119Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6516239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6516359Z     )
2025-05-07T20:33:00.6516608Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6516711Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6516790Z         self,
2025-05-07T20:33:00.6516867Z         T: int,
2025-05-07T20:33:00.6516950Z         D: int,
2025-05-07T20:33:00.6517050Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6517139Z         contiguous: bool,
2025-05-07T20:33:00.6517230Z         compiled: bool,
2025-05-07T20:33:00.6517310Z     ) -> None:
2025-05-07T20:33:00.6517411Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6517526Z     
2025-05-07T20:33:00.6517697Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6517778Z     
2025-05-07T20:33:00.6517871Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6517998Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6518091Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6518176Z         x0 = x[:, :D]
2025-05-07T20:33:00.6518259Z         x1 = x[:, D:]
2025-05-07T20:33:00.6518338Z     
2025-05-07T20:33:00.6518425Z         if contiguous:
2025-05-07T20:33:00.6518519Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6518612Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6518685Z     
2025-05-07T20:33:00.6518827Z         if scale_ub is not None:
2025-05-07T20:33:00.6518935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6519071Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6519158Z             )
2025-05-07T20:33:00.6519233Z         else:
2025-05-07T20:33:00.6519330Z             scale_ub_tensor = None
2025-05-07T20:33:00.6519412Z     
2025-05-07T20:33:00.6519545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6519635Z             op = silu_mul_quant
2025-05-07T20:33:00.6519727Z             if compiled:
2025-05-07T20:33:00.6519828Z                 op = torch.compile(op)
2025-05-07T20:33:00.6519937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6520017Z     
2025-05-07T20:33:00.6520108Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.6520237Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.6520377Z     
2025-05-07T20:33:00.6520517Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6520626Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.6520726Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.6520849Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.6520997Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6521074Z     
2025-05-07T20:33:00.6521175Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.6521180Z 
2025-05-07T20:33:00.6521286Z moe/activation_test.py:126: 
2025-05-07T20:33:00.6521412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6521522Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.6521664Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6522231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.6522345Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.6522709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6522932Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6523307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.6523568Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6523976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.6524272Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6524653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.6524833Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.6525178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.6525264Z     fn()
2025-05-07T20:33:00.6525668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.6525795Z     self.fn.run(
2025-05-07T20:33:00.6526143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6526239Z     kernel = self.compile(
2025-05-07T20:33:00.6526626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6526813Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6526939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6526947Z 
2025-05-07T20:33:00.6527201Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da39bbe0>
2025-05-07T20:33:00.6527988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6528503Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daf79b40>}
2025-05-07T20:33:00.6529266Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6529460Z context = <triton._C.libtriton.ir.context object at 0x7fd505c3c630>
2025-05-07T20:33:00.6529518Z 
2025-05-07T20:33:00.6529692Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6529964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6530081Z                            module_map=module_map)
2025-05-07T20:33:00.6530243Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6530348Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.6530435Z E       ^
2025-05-07T20:33:00.6530790Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6530795Z 
2025-05-07T20:33:00.6531213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6531220Z 
2025-05-07T20:33:00.6531335Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6531559Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6531645Z     T=1,
2025-05-07T20:33:00.6531726Z     D=5120,
2025-05-07T20:33:00.6531811Z     scale_ub=None,
2025-05-07T20:33:00.6531904Z     contiguous=True,
2025-05-07T20:33:00.6531990Z     compiled=False,
2025-05-07T20:33:00.6532063Z )
2025-05-07T20:33:00.6532290Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6532454Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.6532460Z 
2025-05-07T20:33:00.6532539Z     @given(
2025-05-07T20:33:00.6532666Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6532771Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6532894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6533058Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6533174Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6533256Z     )
2025-05-07T20:33:00.6533504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6533605Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6533687Z         self,
2025-05-07T20:33:00.6533763Z         T: int,
2025-05-07T20:33:00.6533839Z         D: int,
2025-05-07T20:33:00.6533942Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6534030Z         contiguous: bool,
2025-05-07T20:33:00.6534115Z         compiled: bool,
2025-05-07T20:33:00.6534248Z     ) -> None:
2025-05-07T20:33:00.6534342Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6534422Z     
2025-05-07T20:33:00.6534593Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6534666Z     
2025-05-07T20:33:00.6534774Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6534918Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6535030Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6535119Z         x0 = x[:, :D]
2025-05-07T20:33:00.6535198Z         x1 = x[:, D:]
2025-05-07T20:33:00.6535273Z     
2025-05-07T20:33:00.6535364Z         if contiguous:
2025-05-07T20:33:00.6535495Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6535584Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6535661Z     
2025-05-07T20:33:00.6535750Z         if scale_ub is not None:
2025-05-07T20:33:00.6535858Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6535993Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6536071Z             )
2025-05-07T20:33:00.6536149Z         else:
2025-05-07T20:33:00.6536243Z             scale_ub_tensor = None
2025-05-07T20:33:00.6536316Z     
2025-05-07T20:33:00.6536453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6536543Z             op = silu_mul_quant
2025-05-07T20:33:00.6536627Z             if compiled:
2025-05-07T20:33:00.6536733Z                 op = torch.compile(op)
2025-05-07T20:33:00.6536838Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6536957Z     
2025-05-07T20:33:00.6537050Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6537054Z 
2025-05-07T20:33:00.6537153Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6537287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6537388Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6537489Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6538001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6538220Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6538616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6538843Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6539192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6539295Z     kernel = self.compile(
2025-05-07T20:33:00.6539685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6539868Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6539993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6539998Z 
2025-05-07T20:33:00.6540208Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da39b5e0>
2025-05-07T20:33:00.6540999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6541556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daf79ea0>}
2025-05-07T20:33:00.6542328Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6542528Z context = <triton._C.libtriton.ir.context object at 0x7fd505b51170>
2025-05-07T20:33:00.6542533Z 
2025-05-07T20:33:00.6542704Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6543018Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6543130Z                            module_map=module_map)
2025-05-07T20:33:00.6543299Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6543400Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6543478Z E       ^
2025-05-07T20:33:00.6543847Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6543854Z 
2025-05-07T20:33:00.6544313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6544318Z 
2025-05-07T20:33:00.6544433Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6544660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6544737Z     T=128,
2025-05-07T20:33:00.6544821Z     D=5120,
2025-05-07T20:33:00.6544911Z     scale_ub=None,
2025-05-07T20:33:00.6545019Z     contiguous=False,
2025-05-07T20:33:00.6545118Z     compiled=True,
2025-05-07T20:33:00.6545209Z )
2025-05-07T20:33:00.6545436Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6545620Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.6545627Z 
2025-05-07T20:33:00.6545705Z     @given(
2025-05-07T20:33:00.6545832Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6545932Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6546090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6546222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6546339Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6546415Z     )
2025-05-07T20:33:00.6546673Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6546768Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6546846Z         self,
2025-05-07T20:33:00.6546931Z         T: int,
2025-05-07T20:33:00.6547006Z         D: int,
2025-05-07T20:33:00.6547111Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6547202Z         contiguous: bool,
2025-05-07T20:33:00.6547288Z         compiled: bool,
2025-05-07T20:33:00.6547372Z     ) -> None:
2025-05-07T20:33:00.6547470Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6547543Z     
2025-05-07T20:33:00.6547719Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6547799Z     
2025-05-07T20:33:00.6547891Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6548027Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6548118Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6548199Z         x0 = x[:, :D]
2025-05-07T20:33:00.6548286Z         x1 = x[:, D:]
2025-05-07T20:33:00.6548359Z     
2025-05-07T20:33:00.6548443Z         if contiguous:
2025-05-07T20:33:00.6548544Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6548635Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6548712Z     
2025-05-07T20:33:00.6548807Z         if scale_ub is not None:
2025-05-07T20:33:00.6548913Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6549055Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6549130Z             )
2025-05-07T20:33:00.6549252Z         else:
2025-05-07T20:33:00.6549355Z             scale_ub_tensor = None
2025-05-07T20:33:00.6549426Z     
2025-05-07T20:33:00.6549561Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6549658Z             op = silu_mul_quant
2025-05-07T20:33:00.6549744Z             if compiled:
2025-05-07T20:33:00.6549847Z                 op = torch.compile(op)
2025-05-07T20:33:00.6549963Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6550035Z     
2025-05-07T20:33:00.6550134Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6550139Z 
2025-05-07T20:33:00.6550279Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6550410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6550520Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6550621Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6550997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6551097Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6551598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6551769Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6552131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6552356Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6552708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6552811Z     kernel = self.compile(
2025-05-07T20:33:00.6553196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6553380Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6553507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6553512Z 
2025-05-07T20:33:00.6553725Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505c2f880>
2025-05-07T20:33:00.6554554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6555069Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daf78dc0>}
2025-05-07T20:33:00.6556113Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6556316Z context = <triton._C.libtriton.ir.context object at 0x7fd505b4a7f0>
2025-05-07T20:33:00.6556322Z 
2025-05-07T20:33:00.6556497Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6556770Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6556886Z                            module_map=module_map)
2025-05-07T20:33:00.6557050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6557151Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6557237Z E       ^
2025-05-07T20:33:00.6557596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6557604Z 
2025-05-07T20:33:00.6558025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6558038Z 
2025-05-07T20:33:00.6558144Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6558457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6558544Z     T=128,
2025-05-07T20:33:00.6558621Z     D=7168,
2025-05-07T20:33:00.6558709Z     scale_ub=1200.0,
2025-05-07T20:33:00.6558803Z     contiguous=False,
2025-05-07T20:33:00.6558887Z     compiled=False,
2025-05-07T20:33:00.6558965Z )
2025-05-07T20:33:00.6559193Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6559368Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.6559373Z 
2025-05-07T20:33:00.6559453Z     @given(
2025-05-07T20:33:00.6559639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6559739Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6559859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6559979Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6560094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6560175Z     )
2025-05-07T20:33:00.6560427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6560525Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6560611Z         self,
2025-05-07T20:33:00.6560688Z         T: int,
2025-05-07T20:33:00.6560826Z         D: int,
2025-05-07T20:33:00.6560932Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6561021Z         contiguous: bool,
2025-05-07T20:33:00.6561113Z         compiled: bool,
2025-05-07T20:33:00.6561192Z     ) -> None:
2025-05-07T20:33:00.6561289Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6561370Z     
2025-05-07T20:33:00.6561545Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6561620Z     
2025-05-07T20:33:00.6561717Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6561841Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6561928Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6562014Z         x0 = x[:, :D]
2025-05-07T20:33:00.6562097Z         x1 = x[:, D:]
2025-05-07T20:33:00.6562170Z     
2025-05-07T20:33:00.6562260Z         if contiguous:
2025-05-07T20:33:00.6562352Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6562504Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6562583Z     
2025-05-07T20:33:00.6562677Z         if scale_ub is not None:
2025-05-07T20:33:00.6562785Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6562918Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6562990Z             )
2025-05-07T20:33:00.6563071Z         else:
2025-05-07T20:33:00.6563163Z             scale_ub_tensor = None
2025-05-07T20:33:00.6563237Z     
2025-05-07T20:33:00.6563374Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6563462Z             op = silu_mul_quant
2025-05-07T20:33:00.6563545Z             if compiled:
2025-05-07T20:33:00.6563649Z                 op = torch.compile(op)
2025-05-07T20:33:00.6563756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6563827Z     
2025-05-07T20:33:00.6563922Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6563927Z 
2025-05-07T20:33:00.6564024Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6564162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6564265Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6564364Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6564902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6565015Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6565390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6565621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6566010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6566111Z     kernel = self.compile(
2025-05-07T20:33:00.6566496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6566676Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6566806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6566811Z 
2025-05-07T20:33:00.6567016Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505f6bd90>
2025-05-07T20:33:00.6567810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6568361Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daba3f40>}
2025-05-07T20:33:00.6569117Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6569358Z context = <triton._C.libtriton.ir.context object at 0x7fd505bb64b0>
2025-05-07T20:33:00.6569363Z 
2025-05-07T20:33:00.6569531Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6569802Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6569914Z                            module_map=module_map)
2025-05-07T20:33:00.6570077Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6570182Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6570261Z E       ^
2025-05-07T20:33:00.6570628Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6570633Z 
2025-05-07T20:33:00.6571053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6571099Z 
2025-05-07T20:33:00.6571205Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6571437Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6571515Z     T=128,
2025-05-07T20:33:00.6571591Z     D=5120,
2025-05-07T20:33:00.6571679Z     scale_ub=None,
2025-05-07T20:33:00.6571767Z     contiguous=False,
2025-05-07T20:33:00.6571858Z     compiled=False,
2025-05-07T20:33:00.6571934Z )
2025-05-07T20:33:00.6572155Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6572333Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.6572338Z 
2025-05-07T20:33:00.6572416Z     @given(
2025-05-07T20:33:00.6572538Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6572647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6572766Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6572887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6573011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6573088Z     )
2025-05-07T20:33:00.6573342Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6573437Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6573514Z         self,
2025-05-07T20:33:00.6573598Z         T: int,
2025-05-07T20:33:00.6573678Z         D: int,
2025-05-07T20:33:00.6573778Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6573875Z         contiguous: bool,
2025-05-07T20:33:00.6573963Z         compiled: bool,
2025-05-07T20:33:00.6574040Z     ) -> None:
2025-05-07T20:33:00.6574143Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6574218Z     
2025-05-07T20:33:00.6574436Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6574517Z     
2025-05-07T20:33:00.6574609Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6574741Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6574831Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6574913Z         x0 = x[:, :D]
2025-05-07T20:33:00.6575019Z         x1 = x[:, D:]
2025-05-07T20:33:00.6575092Z     
2025-05-07T20:33:00.6575195Z         if contiguous:
2025-05-07T20:33:00.6575295Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6575381Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6575453Z     
2025-05-07T20:33:00.6575590Z         if scale_ub is not None:
2025-05-07T20:33:00.6575696Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6575832Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6575914Z             )
2025-05-07T20:33:00.6575987Z         else:
2025-05-07T20:33:00.6576085Z             scale_ub_tensor = None
2025-05-07T20:33:00.6576160Z     
2025-05-07T20:33:00.6576290Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6576388Z             op = silu_mul_quant
2025-05-07T20:33:00.6576473Z             if compiled:
2025-05-07T20:33:00.6576572Z                 op = torch.compile(op)
2025-05-07T20:33:00.6576722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6576794Z     
2025-05-07T20:33:00.6576882Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6576887Z 
2025-05-07T20:33:00.6576989Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6577116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6577221Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6577326Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6577830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6577934Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6578371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6578594Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6578990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6579084Z     kernel = self.compile(
2025-05-07T20:33:00.6579473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6579648Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6579775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6579780Z 
2025-05-07T20:33:00.6579995Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505f69fc0>
2025-05-07T20:33:00.6580786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6581308Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daba12d0>}
2025-05-07T20:33:00.6582064Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6582261Z context = <triton._C.libtriton.ir.context object at 0x7fd505fb1fb0>
2025-05-07T20:33:00.6582266Z 
2025-05-07T20:33:00.6582438Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6582705Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6582860Z                            module_map=module_map)
2025-05-07T20:33:00.6583023Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6583122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6583207Z E       ^
2025-05-07T20:33:00.6583568Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6583573Z 
2025-05-07T20:33:00.6583998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6584003Z 
2025-05-07T20:33:00.6584107Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6584398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6584486Z     T=128,
2025-05-07T20:33:00.6584563Z     D=5120,
2025-05-07T20:33:00.6584649Z     scale_ub=1200.0,
2025-05-07T20:33:00.6584743Z     contiguous=True,
2025-05-07T20:33:00.6584827Z     compiled=False,
2025-05-07T20:33:00.6584903Z )
2025-05-07T20:33:00.6585136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6585311Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.6585318Z 
2025-05-07T20:33:00.6585402Z     @given(
2025-05-07T20:33:00.6585564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6585667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6585792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6585913Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6586032Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6586118Z     )
2025-05-07T20:33:00.6586367Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6591200Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6591287Z         self,
2025-05-07T20:33:00.6591361Z         T: int,
2025-05-07T20:33:00.6591436Z         D: int,
2025-05-07T20:33:00.6591537Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6591625Z         contiguous: bool,
2025-05-07T20:33:00.6591711Z         compiled: bool,
2025-05-07T20:33:00.6591862Z     ) -> None:
2025-05-07T20:33:00.6591959Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6592031Z     
2025-05-07T20:33:00.6592209Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6592283Z     
2025-05-07T20:33:00.6592373Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6592497Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6592586Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6592667Z         x0 = x[:, :D]
2025-05-07T20:33:00.6592742Z         x1 = x[:, D:]
2025-05-07T20:33:00.6592813Z     
2025-05-07T20:33:00.6592895Z         if contiguous:
2025-05-07T20:33:00.6592988Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6593081Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6593152Z     
2025-05-07T20:33:00.6593245Z         if scale_ub is not None:
2025-05-07T20:33:00.6593354Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6593488Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6593566Z             )
2025-05-07T20:33:00.6593639Z         else:
2025-05-07T20:33:00.6593731Z             scale_ub_tensor = None
2025-05-07T20:33:00.6593805Z     
2025-05-07T20:33:00.6593934Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6594022Z             op = silu_mul_quant
2025-05-07T20:33:00.6594106Z             if compiled:
2025-05-07T20:33:00.6594203Z                 op = torch.compile(op)
2025-05-07T20:33:00.6594308Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6594382Z     
2025-05-07T20:33:00.6594470Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6594475Z 
2025-05-07T20:33:00.6594574Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6594702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6594873Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6594986Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6595509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6595610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6595974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6596195Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6596548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6596682Z     kernel = self.compile(
2025-05-07T20:33:00.6597068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6597246Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6597370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6597375Z 
2025-05-07T20:33:00.6597581Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505f685b0>
2025-05-07T20:33:00.6598418Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6598929Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daba3d90>}
2025-05-07T20:33:00.6599691Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6599886Z context = <triton._C.libtriton.ir.context object at 0x7fd505f213f0>
2025-05-07T20:33:00.6599891Z 
2025-05-07T20:33:00.6600058Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6600366Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6600473Z                            module_map=module_map)
2025-05-07T20:33:00.6600638Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6600736Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6600810Z E       ^
2025-05-07T20:33:00.6601171Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6601179Z 
2025-05-07T20:33:00.6601598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6601603Z 
2025-05-07T20:33:00.6601710Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6601934Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6602011Z     T=1,
2025-05-07T20:33:00.6602087Z     D=7168,
2025-05-07T20:33:00.6602170Z     scale_ub=1200.0,
2025-05-07T20:33:00.6602253Z     contiguous=True,
2025-05-07T20:33:00.6602334Z     compiled=True,
2025-05-07T20:33:00.6602406Z )
2025-05-07T20:33:00.6602629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6602798Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.6602803Z 
2025-05-07T20:33:00.6602876Z     @given(
2025-05-07T20:33:00.6603000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6603096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6603209Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6603331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6603443Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6603560Z     )
2025-05-07T20:33:00.6603811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6603904Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6603983Z         self,
2025-05-07T20:33:00.6604056Z         T: int,
2025-05-07T20:33:00.6604133Z         D: int,
2025-05-07T20:33:00.6604237Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6604323Z         contiguous: bool,
2025-05-07T20:33:00.6604405Z         compiled: bool,
2025-05-07T20:33:00.6604483Z     ) -> None:
2025-05-07T20:33:00.6604575Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6604645Z     
2025-05-07T20:33:00.6604860Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6604938Z     
2025-05-07T20:33:00.6605044Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6605192Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6605280Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6605361Z         x0 = x[:, :D]
2025-05-07T20:33:00.6605439Z         x1 = x[:, D:]
2025-05-07T20:33:00.6605508Z     
2025-05-07T20:33:00.6605591Z         if contiguous:
2025-05-07T20:33:00.6605681Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6605770Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6605844Z     
2025-05-07T20:33:00.6605973Z         if scale_ub is not None:
2025-05-07T20:33:00.6606078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6606220Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6606292Z             )
2025-05-07T20:33:00.6606364Z         else:
2025-05-07T20:33:00.6606460Z             scale_ub_tensor = None
2025-05-07T20:33:00.6606540Z     
2025-05-07T20:33:00.6606669Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6606756Z             op = silu_mul_quant
2025-05-07T20:33:00.6606842Z             if compiled:
2025-05-07T20:33:00.6606940Z                 op = torch.compile(op)
2025-05-07T20:33:00.6607047Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6607121Z     
2025-05-07T20:33:00.6607209Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6607213Z 
2025-05-07T20:33:00.6607360Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6607489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6607589Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6607695Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6608068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6608160Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6608664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6608759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6609122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6609348Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6609691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6609791Z     kernel = self.compile(
2025-05-07T20:33:00.6610176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6610356Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6610478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6610486Z 
2025-05-07T20:33:00.6610694Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505f6a4d0>
2025-05-07T20:33:00.6611530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6612041Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daba31c0>}
2025-05-07T20:33:00.6612810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6613000Z context = <triton._C.libtriton.ir.context object at 0x7fd505dda1b0>
2025-05-07T20:33:00.6613005Z 
2025-05-07T20:33:00.6613209Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6613482Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6613588Z                            module_map=module_map)
2025-05-07T20:33:00.6613754Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6613854Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6613930Z E       ^
2025-05-07T20:33:00.6614297Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6614305Z 
2025-05-07T20:33:00.6614763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6614769Z 
2025-05-07T20:33:00.6614875Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6615098Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6615177Z     T=1,
2025-05-07T20:33:00.6615258Z     D=7168,
2025-05-07T20:33:00.6615340Z     scale_ub=1200.0,
2025-05-07T20:33:00.6615423Z     contiguous=False,
2025-05-07T20:33:00.6615506Z     compiled=True,
2025-05-07T20:33:00.6615576Z )
2025-05-07T20:33:00.6615795Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6615973Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.6615977Z 
2025-05-07T20:33:00.6616050Z     @given(
2025-05-07T20:33:00.6616172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6616312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6616427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6616549Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6616660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6616730Z     )
2025-05-07T20:33:00.6616980Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6617076Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6617149Z         self,
2025-05-07T20:33:00.6617228Z         T: int,
2025-05-07T20:33:00.6617300Z         D: int,
2025-05-07T20:33:00.6617397Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6617489Z         contiguous: bool,
2025-05-07T20:33:00.6617574Z         compiled: bool,
2025-05-07T20:33:00.6617653Z     ) -> None:
2025-05-07T20:33:00.6617746Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6617816Z     
2025-05-07T20:33:00.6617993Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6618171Z     
2025-05-07T20:33:00.6618265Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6618392Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6618478Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6618553Z         x0 = x[:, :D]
2025-05-07T20:33:00.6618635Z         x1 = x[:, D:]
2025-05-07T20:33:00.6618702Z     
2025-05-07T20:33:00.6618785Z         if contiguous:
2025-05-07T20:33:00.6618880Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6618965Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6619034Z     
2025-05-07T20:33:00.6619127Z         if scale_ub is not None:
2025-05-07T20:33:00.6619231Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6619442Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6619516Z             )
2025-05-07T20:33:00.6619590Z         else:
2025-05-07T20:33:00.6619686Z             scale_ub_tensor = None
2025-05-07T20:33:00.6619758Z     
2025-05-07T20:33:00.6619887Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6619983Z             op = silu_mul_quant
2025-05-07T20:33:00.6620064Z             if compiled:
2025-05-07T20:33:00.6620161Z                 op = torch.compile(op)
2025-05-07T20:33:00.6620269Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6620338Z     
2025-05-07T20:33:00.6620426Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6620476Z 
2025-05-07T20:33:00.6620574Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6620701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6620806Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6620904Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6621282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6621377Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6621919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6622016Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6622378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6622600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6622950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6623043Z     kernel = self.compile(
2025-05-07T20:33:00.6623427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6623609Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6623733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6623779Z 
2025-05-07T20:33:00.6623991Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505f057b0>
2025-05-07T20:33:00.6624790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6625340Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daba05e0>}
2025-05-07T20:33:00.6626101Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6626294Z context = <triton._C.libtriton.ir.context object at 0x7fd505dd1570>
2025-05-07T20:33:00.6626299Z 
2025-05-07T20:33:00.6626466Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6626736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6626843Z                            module_map=module_map)
2025-05-07T20:33:00.6627012Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6627110Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6627189Z E       ^
2025-05-07T20:33:00.6627549Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6627554Z 
2025-05-07T20:33:00.6627973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6627978Z 
2025-05-07T20:33:00.6628132Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6628358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6628434Z     T=1,
2025-05-07T20:33:00.6628509Z     D=7168,
2025-05-07T20:33:00.6628587Z     scale_ub=None,
2025-05-07T20:33:00.6628677Z     contiguous=False,
2025-05-07T20:33:00.6628756Z     compiled=True,
2025-05-07T20:33:00.6628824Z )
2025-05-07T20:33:00.6629044Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6629209Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.6629214Z 
2025-05-07T20:33:00.6629328Z     @given(
2025-05-07T20:33:00.6629452Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6629547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6629663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6629780Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6629894Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6629969Z     )
2025-05-07T20:33:00.6630214Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6630306Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6630382Z         self,
2025-05-07T20:33:00.6630493Z         T: int,
2025-05-07T20:33:00.6630566Z         D: int,
2025-05-07T20:33:00.6630667Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6630755Z         contiguous: bool,
2025-05-07T20:33:00.6630837Z         compiled: bool,
2025-05-07T20:33:00.6630913Z     ) -> None:
2025-05-07T20:33:00.6631004Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6631083Z     
2025-05-07T20:33:00.6631251Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6631322Z     
2025-05-07T20:33:00.6631415Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6631538Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6631623Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6631705Z         x0 = x[:, :D]
2025-05-07T20:33:00.6631782Z         x1 = x[:, D:]
2025-05-07T20:33:00.6631850Z     
2025-05-07T20:33:00.6631934Z         if contiguous:
2025-05-07T20:33:00.6632067Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6632155Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6632228Z     
2025-05-07T20:33:00.6632315Z         if scale_ub is not None:
2025-05-07T20:33:00.6632418Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6632555Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6632628Z             )
2025-05-07T20:33:00.6632705Z         else:
2025-05-07T20:33:00.6632798Z             scale_ub_tensor = None
2025-05-07T20:33:00.6632866Z     
2025-05-07T20:33:00.6632998Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6633087Z             op = silu_mul_quant
2025-05-07T20:33:00.6633167Z             if compiled:
2025-05-07T20:33:00.6633265Z                 op = torch.compile(op)
2025-05-07T20:33:00.6633373Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6633441Z     
2025-05-07T20:33:00.6633531Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.6633653Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.6633722Z     
2025-05-07T20:33:00.6633861Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6633960Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.6634063Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.6634183Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.6634323Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6634395Z     
2025-05-07T20:33:00.6634491Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.6634496Z 
2025-05-07T20:33:00.6634593Z moe/activation_test.py:126: 
2025-05-07T20:33:00.6634721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6634869Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.6635007Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.6635626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.6635729Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.6636094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6636315Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6636728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.6636986Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6637389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.6637644Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.6638020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.6638226Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.6638576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.6638651Z     fn()
2025-05-07T20:33:00.6639054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.6639136Z     self.fn.run(
2025-05-07T20:33:00.6639476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6639570Z     kernel = self.compile(
2025-05-07T20:33:00.6639952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6640127Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6640295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6640302Z 
2025-05-07T20:33:00.6640507Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505d683d0>
2025-05-07T20:33:00.6641295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6641805Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da3aaef0>}
2025-05-07T20:33:00.6642566Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6642759Z context = <triton._C.libtriton.ir.context object at 0x7fd505e79570>
2025-05-07T20:33:00.6642765Z 
2025-05-07T20:33:00.6642933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6643201Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6643306Z                            module_map=module_map)
2025-05-07T20:33:00.6643466Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6643572Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.6643647Z E       ^
2025-05-07T20:33:00.6644008Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6644013Z 
2025-05-07T20:33:00.6645003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6645009Z 
2025-05-07T20:33:00.6645115Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6645343Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6645422Z     T=1,
2025-05-07T20:33:00.6645500Z     D=5120,
2025-05-07T20:33:00.6645581Z     scale_ub=1200.0,
2025-05-07T20:33:00.6645667Z     contiguous=False,
2025-05-07T20:33:00.6645751Z     compiled=True,
2025-05-07T20:33:00.6645821Z )
2025-05-07T20:33:00.6646039Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6646285Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.6646289Z 
2025-05-07T20:33:00.6646364Z     @given(
2025-05-07T20:33:00.6646484Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6646588Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6646701Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6646824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6646939Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6647014Z     )
2025-05-07T20:33:00.6647265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6647400Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6647476Z         self,
2025-05-07T20:33:00.6647551Z         T: int,
2025-05-07T20:33:00.6647623Z         D: int,
2025-05-07T20:33:00.6647719Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6647810Z         contiguous: bool,
2025-05-07T20:33:00.6647899Z         compiled: bool,
2025-05-07T20:33:00.6647973Z     ) -> None:
2025-05-07T20:33:00.6648068Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6648137Z     
2025-05-07T20:33:00.6648312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6648385Z     
2025-05-07T20:33:00.6648474Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6648601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6648686Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6648762Z         x0 = x[:, :D]
2025-05-07T20:33:00.6648908Z         x1 = x[:, D:]
2025-05-07T20:33:00.6648979Z     
2025-05-07T20:33:00.6649060Z         if contiguous:
2025-05-07T20:33:00.6649156Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6649241Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6649311Z     
2025-05-07T20:33:00.6649406Z         if scale_ub is not None:
2025-05-07T20:33:00.6649513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6649653Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6649725Z             )
2025-05-07T20:33:00.6649798Z         else:
2025-05-07T20:33:00.6649893Z             scale_ub_tensor = None
2025-05-07T20:33:00.6649961Z     
2025-05-07T20:33:00.6650091Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6650182Z             op = silu_mul_quant
2025-05-07T20:33:00.6650267Z             if compiled:
2025-05-07T20:33:00.6650366Z                 op = torch.compile(op)
2025-05-07T20:33:00.6650469Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6650541Z     
2025-05-07T20:33:00.6650631Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6650639Z 
2025-05-07T20:33:00.6650733Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6650857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6650963Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6651059Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6651433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6651527Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6652024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6652165Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6652527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6652751Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6653100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6653190Z     kernel = self.compile(
2025-05-07T20:33:00.6653576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6653792Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6653916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6653920Z 
2025-05-07T20:33:00.6654128Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505dc0670>
2025-05-07T20:33:00.6654942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6655519Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da3abeb0>}
2025-05-07T20:33:00.6656571Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6656771Z context = <triton._C.libtriton.ir.context object at 0x7fd505edf270>
2025-05-07T20:33:00.6656777Z 
2025-05-07T20:33:00.6656946Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6657211Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6657320Z                            module_map=module_map)
2025-05-07T20:33:00.6657482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6657679Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6657756Z E       ^
2025-05-07T20:33:00.6658174Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6658179Z 
2025-05-07T20:33:00.6658599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6658603Z 
2025-05-07T20:33:00.6658707Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6658929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6659010Z     T=1,
2025-05-07T20:33:00.6659083Z     D=5120,
2025-05-07T20:33:00.6659163Z     scale_ub=1200.0,
2025-05-07T20:33:00.6659250Z     contiguous=False,
2025-05-07T20:33:00.6659329Z     compiled=False,
2025-05-07T20:33:00.6659401Z )
2025-05-07T20:33:00.6659621Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6659788Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.6659796Z 
2025-05-07T20:33:00.6659875Z     @given(
2025-05-07T20:33:00.6659992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6660090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6660206Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6660321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6660435Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6660513Z     )
2025-05-07T20:33:00.6660759Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6660852Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6660928Z         self,
2025-05-07T20:33:00.6661002Z         T: int,
2025-05-07T20:33:00.6661151Z         D: int,
2025-05-07T20:33:00.6661250Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6661337Z         contiguous: bool,
2025-05-07T20:33:00.6661422Z         compiled: bool,
2025-05-07T20:33:00.6661501Z     ) -> None:
2025-05-07T20:33:00.6661593Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6661668Z     
2025-05-07T20:33:00.6661838Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6661912Z     
2025-05-07T20:33:00.6662005Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6662128Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6662272Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6662351Z         x0 = x[:, :D]
2025-05-07T20:33:00.6662426Z         x1 = x[:, D:]
2025-05-07T20:33:00.6662495Z     
2025-05-07T20:33:00.6662577Z         if contiguous:
2025-05-07T20:33:00.6662665Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6662754Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6662824Z     
2025-05-07T20:33:00.6662919Z         if scale_ub is not None:
2025-05-07T20:33:00.6663024Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6663154Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6663230Z             )
2025-05-07T20:33:00.6663303Z         else:
2025-05-07T20:33:00.6663453Z             scale_ub_tensor = None
2025-05-07T20:33:00.6663523Z     
2025-05-07T20:33:00.6663657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6663743Z             op = silu_mul_quant
2025-05-07T20:33:00.6663825Z             if compiled:
2025-05-07T20:33:00.6663922Z                 op = torch.compile(op)
2025-05-07T20:33:00.6664032Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6664105Z     
2025-05-07T20:33:00.6664192Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6664197Z 
2025-05-07T20:33:00.6664292Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6664418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6664519Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6664616Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6665159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6665311Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6665679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6665899Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6666246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6666341Z     kernel = self.compile(
2025-05-07T20:33:00.6666725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6666901Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6667025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6667032Z 
2025-05-07T20:33:00.6667236Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505e12d10>
2025-05-07T20:33:00.6668025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6668536Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6daba1c60>}
2025-05-07T20:33:00.6669292Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6669526Z context = <triton._C.libtriton.ir.context object at 0x7fd505e949f0>
2025-05-07T20:33:00.6669531Z 
2025-05-07T20:33:00.6669696Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6669972Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6670078Z                            module_map=module_map)
2025-05-07T20:33:00.6670243Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6670339Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6670412Z E       ^
2025-05-07T20:33:00.6670818Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6670823Z 
2025-05-07T20:33:00.6671239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6671244Z 
2025-05-07T20:33:00.6671345Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6671574Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6671652Z     T=16384,
2025-05-07T20:33:00.6671733Z     D=5120,
2025-05-07T20:33:00.6671813Z     scale_ub=1200.0,
2025-05-07T20:33:00.6671895Z     contiguous=False,
2025-05-07T20:33:00.6672017Z     compiled=True,
2025-05-07T20:33:00.6672089Z )
2025-05-07T20:33:00.6672309Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6672490Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.6672495Z 
2025-05-07T20:33:00.6672573Z     @given(
2025-05-07T20:33:00.6672688Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6672785Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6672897Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6673018Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6673130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6673201Z     )
2025-05-07T20:33:00.6673448Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6673584Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6673659Z         self,
2025-05-07T20:33:00.6673741Z         T: int,
2025-05-07T20:33:00.6673815Z         D: int,
2025-05-07T20:33:00.6673910Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6674000Z         contiguous: bool,
2025-05-07T20:33:00.6674082Z         compiled: bool,
2025-05-07T20:33:00.6674159Z     ) -> None:
2025-05-07T20:33:00.6674255Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6674327Z     
2025-05-07T20:33:00.6674496Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6674566Z     
2025-05-07T20:33:00.6674654Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6674778Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6674862Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6674940Z         x0 = x[:, :D]
2025-05-07T20:33:00.6675019Z         x1 = x[:, D:]
2025-05-07T20:33:00.6675090Z     
2025-05-07T20:33:00.6675171Z         if contiguous:
2025-05-07T20:33:00.6675266Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6675352Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6675422Z     
2025-05-07T20:33:00.6675512Z         if scale_ub is not None:
2025-05-07T20:33:00.6675615Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6675751Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6675825Z             )
2025-05-07T20:33:00.6675898Z         else:
2025-05-07T20:33:00.6675991Z             scale_ub_tensor = None
2025-05-07T20:33:00.6676061Z     
2025-05-07T20:33:00.6676188Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6676278Z             op = silu_mul_quant
2025-05-07T20:33:00.6676359Z             if compiled:
2025-05-07T20:33:00.6676454Z                 op = torch.compile(op)
2025-05-07T20:33:00.6676604Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6676675Z     
2025-05-07T20:33:00.6676762Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6676768Z 
2025-05-07T20:33:00.6676867Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6676992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6677092Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6677187Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6677556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6677690Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6678185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6678281Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6678646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6678868Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6679215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6679372Z     kernel = self.compile(
2025-05-07T20:33:00.6679757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6679937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6680060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6680067Z 
2025-05-07T20:33:00.6680276Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5058fae00>
2025-05-07T20:33:00.6681061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6681565Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5058148b0>}
2025-05-07T20:33:00.6682367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6682559Z context = <triton._C.libtriton.ir.context object at 0x7fd5058850b0>
2025-05-07T20:33:00.6682566Z 
2025-05-07T20:33:00.6682734Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6682999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6683104Z                            module_map=module_map)
2025-05-07T20:33:00.6683269Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6683367Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6683445Z E       ^
2025-05-07T20:33:00.6683802Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6683809Z 
2025-05-07T20:33:00.6684229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6684233Z 
2025-05-07T20:33:00.6684338Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6684560Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6684638Z     T=2048,
2025-05-07T20:33:00.6684716Z     D=7168,
2025-05-07T20:33:00.6684819Z     scale_ub=1200.0,
2025-05-07T20:33:00.6684914Z     contiguous=False,
2025-05-07T20:33:00.6685015Z     compiled=True,
2025-05-07T20:33:00.6685085Z )
2025-05-07T20:33:00.6685304Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6685520Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.6685525Z 
2025-05-07T20:33:00.6685598Z     @given(
2025-05-07T20:33:00.6685720Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6685818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6685929Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6686046Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6686156Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6686230Z     )
2025-05-07T20:33:00.6686518Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6686611Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6686686Z         self,
2025-05-07T20:33:00.6686759Z         T: int,
2025-05-07T20:33:00.6686831Z         D: int,
2025-05-07T20:33:00.6686932Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6687017Z         contiguous: bool,
2025-05-07T20:33:00.6687102Z         compiled: bool,
2025-05-07T20:33:00.6687178Z     ) -> None:
2025-05-07T20:33:00.6687270Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6687341Z     
2025-05-07T20:33:00.6687513Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6687625Z     
2025-05-07T20:33:00.6687717Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6687839Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6687924Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6688004Z         x0 = x[:, :D]
2025-05-07T20:33:00.6688079Z         x1 = x[:, D:]
2025-05-07T20:33:00.6688150Z     
2025-05-07T20:33:00.6688233Z         if contiguous:
2025-05-07T20:33:00.6688320Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6688404Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6688474Z     
2025-05-07T20:33:00.6688564Z         if scale_ub is not None:
2025-05-07T20:33:00.6688665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6688801Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6688873Z             )
2025-05-07T20:33:00.6688949Z         else:
2025-05-07T20:33:00.6689083Z             scale_ub_tensor = None
2025-05-07T20:33:00.6689152Z     
2025-05-07T20:33:00.6689286Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6689372Z             op = silu_mul_quant
2025-05-07T20:33:00.6689453Z             if compiled:
2025-05-07T20:33:00.6689551Z                 op = torch.compile(op)
2025-05-07T20:33:00.6689656Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6689729Z     
2025-05-07T20:33:00.6689818Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6689822Z 
2025-05-07T20:33:00.6689918Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6690041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6690142Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6690243Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6690615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6690708Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6691203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6691300Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6691658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6691879Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6692221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6692313Z     kernel = self.compile(
2025-05-07T20:33:00.6692741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6692917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6693038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6693045Z 
2025-05-07T20:33:00.6693254Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5058c28c0>
2025-05-07T20:33:00.6694037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6694581Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505815090>}
2025-05-07T20:33:00.6695387Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6695583Z context = <triton._C.libtriton.ir.context object at 0x7fd5058b3b30>
2025-05-07T20:33:00.6695590Z 
2025-05-07T20:33:00.6695754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6696056Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6696167Z                            module_map=module_map)
2025-05-07T20:33:00.6696331Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6696429Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6696515Z E       ^
2025-05-07T20:33:00.6696873Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6696877Z 
2025-05-07T20:33:00.6697302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6697309Z 
2025-05-07T20:33:00.6697416Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6697640Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6697765Z     T=1,
2025-05-07T20:33:00.6697840Z     D=5120,
2025-05-07T20:33:00.6697924Z     scale_ub=None,
2025-05-07T20:33:00.6698017Z     contiguous=False,
2025-05-07T20:33:00.6698158Z     compiled=False,
2025-05-07T20:33:00.6698237Z )
2025-05-07T20:33:00.6698458Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6698626Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.6698634Z 
2025-05-07T20:33:00.6698713Z     @given(
2025-05-07T20:33:00.6698831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6698931Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6699049Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6699169Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6699282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6699362Z     )
2025-05-07T20:33:00.6699611Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6699716Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6699793Z         self,
2025-05-07T20:33:00.6699868Z         T: int,
2025-05-07T20:33:00.6699953Z         D: int,
2025-05-07T20:33:00.6700051Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6700138Z         contiguous: bool,
2025-05-07T20:33:00.6700229Z         compiled: bool,
2025-05-07T20:33:00.6700307Z     ) -> None:
2025-05-07T20:33:00.6700401Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6700482Z     
2025-05-07T20:33:00.6700652Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6700725Z     
2025-05-07T20:33:00.6700823Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6700946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6701090Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6701169Z         x0 = x[:, :D]
2025-05-07T20:33:00.6701246Z         x1 = x[:, D:]
2025-05-07T20:33:00.6701325Z     
2025-05-07T20:33:00.6701407Z         if contiguous:
2025-05-07T20:33:00.6701497Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6701593Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6701666Z     
2025-05-07T20:33:00.6701760Z         if scale_ub is not None:
2025-05-07T20:33:00.6701869Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6702002Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6702117Z             )
2025-05-07T20:33:00.6702199Z         else:
2025-05-07T20:33:00.6702292Z             scale_ub_tensor = None
2025-05-07T20:33:00.6702364Z     
2025-05-07T20:33:00.6702499Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6702588Z             op = silu_mul_quant
2025-05-07T20:33:00.6702679Z             if compiled:
2025-05-07T20:33:00.6702779Z                 op = torch.compile(op)
2025-05-07T20:33:00.6702884Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6702962Z     
2025-05-07T20:33:00.6703053Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6703058Z 
2025-05-07T20:33:00.6703194Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6703327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6703428Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6703528Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6704037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6704139Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6704506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6704736Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6705080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6705223Z     kernel = self.compile(
2025-05-07T20:33:00.6705612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6705793Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6705915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6705919Z 
2025-05-07T20:33:00.6706124Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5058f43a0>
2025-05-07T20:33:00.6706919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6707431Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5058157e0>}
2025-05-07T20:33:00.6708200Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6708392Z context = <triton._C.libtriton.ir.context object at 0x7fd5057c57b0>
2025-05-07T20:33:00.6708397Z 
2025-05-07T20:33:00.6708564Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6708841Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6708950Z                            module_map=module_map)
2025-05-07T20:33:00.6709118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6709217Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6709293Z E       ^
2025-05-07T20:33:00.6709698Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6709706Z 
2025-05-07T20:33:00.6710131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6710135Z 
2025-05-07T20:33:00.6710246Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6710470Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6710547Z     T=4096,
2025-05-07T20:33:00.6710628Z     D=7168,
2025-05-07T20:33:00.6710775Z     scale_ub=1200.0,
2025-05-07T20:33:00.6710859Z     contiguous=False,
2025-05-07T20:33:00.6710947Z     compiled=False,
2025-05-07T20:33:00.6711028Z )
2025-05-07T20:33:00.6711247Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6711431Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.6711436Z 
2025-05-07T20:33:00.6711515Z     @given(
2025-05-07T20:33:00.6711639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6711740Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6711857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6716056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6716173Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6716248Z     )
2025-05-07T20:33:00.6716499Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6716592Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6716673Z         self,
2025-05-07T20:33:00.6716745Z         T: int,
2025-05-07T20:33:00.6716815Z         D: int,
2025-05-07T20:33:00.6716912Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6717000Z         contiguous: bool,
2025-05-07T20:33:00.6717083Z         compiled: bool,
2025-05-07T20:33:00.6717163Z     ) -> None:
2025-05-07T20:33:00.6717260Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6717329Z     
2025-05-07T20:33:00.6717504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6717622Z     
2025-05-07T20:33:00.6717714Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6717839Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6717924Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6718004Z         x0 = x[:, :D]
2025-05-07T20:33:00.6718081Z         x1 = x[:, D:]
2025-05-07T20:33:00.6718151Z     
2025-05-07T20:33:00.6718235Z         if contiguous:
2025-05-07T20:33:00.6718326Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6718415Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6718485Z     
2025-05-07T20:33:00.6718573Z         if scale_ub is not None:
2025-05-07T20:33:00.6718675Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6718816Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6718889Z             )
2025-05-07T20:33:00.6718966Z         else:
2025-05-07T20:33:00.6719057Z             scale_ub_tensor = None
2025-05-07T20:33:00.6719125Z     
2025-05-07T20:33:00.6719255Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6719345Z             op = silu_mul_quant
2025-05-07T20:33:00.6719428Z             if compiled:
2025-05-07T20:33:00.6719528Z                 op = torch.compile(op)
2025-05-07T20:33:00.6719632Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6719701Z     
2025-05-07T20:33:00.6719791Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6719796Z 
2025-05-07T20:33:00.6719895Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6720024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6720123Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6720220Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6720781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6720877Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6721240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6721472Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6721820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6721919Z     kernel = self.compile(
2025-05-07T20:33:00.6722308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6722532Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6722657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6722662Z 
2025-05-07T20:33:00.6722872Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5058faa40>
2025-05-07T20:33:00.6723670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6724223Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505816200>}
2025-05-07T20:33:00.6724988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6725218Z context = <triton._C.libtriton.ir.context object at 0x7fd505758cb0>
2025-05-07T20:33:00.6725224Z 
2025-05-07T20:33:00.6725407Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6725682Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6725789Z                            module_map=module_map)
2025-05-07T20:33:00.6725991Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6726093Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6726168Z E       ^
2025-05-07T20:33:00.6726527Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6726540Z 
2025-05-07T20:33:00.6726961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6726968Z 
2025-05-07T20:33:00.6727072Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6727300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6727375Z     T=16384,
2025-05-07T20:33:00.6727449Z     D=7168,
2025-05-07T20:33:00.6727533Z     scale_ub=None,
2025-05-07T20:33:00.6727621Z     contiguous=True,
2025-05-07T20:33:00.6727701Z     compiled=True,
2025-05-07T20:33:00.6727775Z )
2025-05-07T20:33:00.6727994Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6728175Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.6728180Z 
2025-05-07T20:33:00.6728254Z     @given(
2025-05-07T20:33:00.6728371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6728471Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6728584Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6728704Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6728820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6728894Z     )
2025-05-07T20:33:00.6729142Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6729241Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6729359Z         self,
2025-05-07T20:33:00.6729437Z         T: int,
2025-05-07T20:33:00.6729510Z         D: int,
2025-05-07T20:33:00.6729605Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6729696Z         contiguous: bool,
2025-05-07T20:33:00.6729780Z         compiled: bool,
2025-05-07T20:33:00.6729858Z     ) -> None:
2025-05-07T20:33:00.6729955Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6730024Z     
2025-05-07T20:33:00.6730193Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6730270Z     
2025-05-07T20:33:00.6730360Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6730528Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6730618Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6730696Z         x0 = x[:, :D]
2025-05-07T20:33:00.6730778Z         x1 = x[:, D:]
2025-05-07T20:33:00.6730848Z     
2025-05-07T20:33:00.6730931Z         if contiguous:
2025-05-07T20:33:00.6731024Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6731113Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6731183Z     
2025-05-07T20:33:00.6731274Z         if scale_ub is not None:
2025-05-07T20:33:00.6731382Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6731517Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6731633Z             )
2025-05-07T20:33:00.6731712Z         else:
2025-05-07T20:33:00.6731804Z             scale_ub_tensor = None
2025-05-07T20:33:00.6731876Z     
2025-05-07T20:33:00.6732007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6732095Z             op = silu_mul_quant
2025-05-07T20:33:00.6732178Z             if compiled:
2025-05-07T20:33:00.6732279Z                 op = torch.compile(op)
2025-05-07T20:33:00.6732381Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6732450Z     
2025-05-07T20:33:00.6732541Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6732545Z 
2025-05-07T20:33:00.6732640Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6732770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6732869Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6733008Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6733386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6733477Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6733978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6734078Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6734443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6734670Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6735016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6735107Z     kernel = self.compile(
2025-05-07T20:33:00.6735500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6735680Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6735804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6735809Z 
2025-05-07T20:33:00.6736016Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5057122f0>
2025-05-07T20:33:00.6736804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6737357Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505817760>}
2025-05-07T20:33:00.6738195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6738396Z context = <triton._C.libtriton.ir.context object at 0x7fd50576dcf0>
2025-05-07T20:33:00.6738401Z 
2025-05-07T20:33:00.6738565Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6738831Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6738981Z                            module_map=module_map)
2025-05-07T20:33:00.6739141Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6739241Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6739313Z E       ^
2025-05-07T20:33:00.6739674Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6739679Z 
2025-05-07T20:33:00.6740107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6740114Z 
2025-05-07T20:33:00.6740216Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6740483Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6740559Z     T=4096,
2025-05-07T20:33:00.6740631Z     D=5120,
2025-05-07T20:33:00.6740714Z     scale_ub=None,
2025-05-07T20:33:00.6740797Z     contiguous=False,
2025-05-07T20:33:00.6740875Z     compiled=True,
2025-05-07T20:33:00.6740950Z )
2025-05-07T20:33:00.6741168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6741341Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.6741346Z 
2025-05-07T20:33:00.6741421Z     @given(
2025-05-07T20:33:00.6741535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6741632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6741747Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6741905Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6742017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6742091Z     )
2025-05-07T20:33:00.6742337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6742430Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6742503Z         self,
2025-05-07T20:33:00.6742575Z         T: int,
2025-05-07T20:33:00.6742652Z         D: int,
2025-05-07T20:33:00.6742747Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6742832Z         contiguous: bool,
2025-05-07T20:33:00.6742916Z         compiled: bool,
2025-05-07T20:33:00.6742990Z     ) -> None:
2025-05-07T20:33:00.6743082Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6743155Z     
2025-05-07T20:33:00.6743324Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6743398Z     
2025-05-07T20:33:00.6743486Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6743608Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6743698Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6743777Z         x0 = x[:, :D]
2025-05-07T20:33:00.6743853Z         x1 = x[:, D:]
2025-05-07T20:33:00.6743924Z     
2025-05-07T20:33:00.6744005Z         if contiguous:
2025-05-07T20:33:00.6744092Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6744181Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6744250Z     
2025-05-07T20:33:00.6744339Z         if scale_ub is not None:
2025-05-07T20:33:00.6744443Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6744575Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6744651Z             )
2025-05-07T20:33:00.6744723Z         else:
2025-05-07T20:33:00.6744813Z             scale_ub_tensor = None
2025-05-07T20:33:00.6744885Z     
2025-05-07T20:33:00.6745117Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6745220Z             op = silu_mul_quant
2025-05-07T20:33:00.6745306Z             if compiled:
2025-05-07T20:33:00.6745404Z                 op = torch.compile(op)
2025-05-07T20:33:00.6745510Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6745584Z     
2025-05-07T20:33:00.6745672Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6745676Z 
2025-05-07T20:33:00.6745772Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6745900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6746040Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6746140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6746515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6746605Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6747113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6747210Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6747611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6747839Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6748188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6748286Z     kernel = self.compile(
2025-05-07T20:33:00.6748677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6748853Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6748979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6748983Z 
2025-05-07T20:33:00.6749192Z self = <triton.compiler.compiler.ASTSource object at 0x7fd50570e980>
2025-05-07T20:33:00.6749989Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6750539Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a4280>}
2025-05-07T20:33:00.6751304Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6751502Z context = <triton._C.libtriton.ir.context object at 0x7fd6da1908f0>
2025-05-07T20:33:00.6751506Z 
2025-05-07T20:33:00.6751672Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6751946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6752051Z                            module_map=module_map)
2025-05-07T20:33:00.6752214Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6752318Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6752392Z E       ^
2025-05-07T20:33:00.6752752Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6752757Z 
2025-05-07T20:33:00.6753177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6753184Z 
2025-05-07T20:33:00.6753286Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6753514Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6753590Z     T=4096,
2025-05-07T20:33:00.6753666Z     D=5120,
2025-05-07T20:33:00.6753790Z     scale_ub=1200.0,
2025-05-07T20:33:00.6753878Z     contiguous=False,
2025-05-07T20:33:00.6753964Z     compiled=False,
2025-05-07T20:33:00.6754037Z )
2025-05-07T20:33:00.6754256Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6754436Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.6754441Z 
2025-05-07T20:33:00.6754517Z     @given(
2025-05-07T20:33:00.6754633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6754734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6754889Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6755034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6755163Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6755243Z     )
2025-05-07T20:33:00.6755493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6755781Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6755899Z         self,
2025-05-07T20:33:00.6756017Z         T: int,
2025-05-07T20:33:00.6756103Z         D: int,
2025-05-07T20:33:00.6756204Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6756293Z         contiguous: bool,
2025-05-07T20:33:00.6756467Z         compiled: bool,
2025-05-07T20:33:00.6756545Z     ) -> None:
2025-05-07T20:33:00.6756641Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6756711Z     
2025-05-07T20:33:00.6756881Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6756956Z     
2025-05-07T20:33:00.6757051Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6757177Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6757262Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6757341Z         x0 = x[:, :D]
2025-05-07T20:33:00.6757421Z         x1 = x[:, D:]
2025-05-07T20:33:00.6757491Z     
2025-05-07T20:33:00.6757572Z         if contiguous:
2025-05-07T20:33:00.6757668Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6757755Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6757827Z     
2025-05-07T20:33:00.6757988Z         if scale_ub is not None:
2025-05-07T20:33:00.6758093Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6758228Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6758305Z             )
2025-05-07T20:33:00.6758380Z         else:
2025-05-07T20:33:00.6758474Z             scale_ub_tensor = None
2025-05-07T20:33:00.6758543Z     
2025-05-07T20:33:00.6758674Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6758765Z             op = silu_mul_quant
2025-05-07T20:33:00.6758847Z             if compiled:
2025-05-07T20:33:00.6758944Z                 op = torch.compile(op)
2025-05-07T20:33:00.6759053Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6759123Z     
2025-05-07T20:33:00.6759213Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6759217Z 
2025-05-07T20:33:00.6759319Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6759444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6759552Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6759650Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6760160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6760261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6760623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6760848Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6761195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6761287Z     kernel = self.compile(
2025-05-07T20:33:00.6761748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6761927Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6762053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6762060Z 
2025-05-07T20:33:00.6762269Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da1021d0>
2025-05-07T20:33:00.6763060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6763631Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a5000>}
2025-05-07T20:33:00.6764397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6764591Z context = <triton._C.libtriton.ir.context object at 0x7fd6da161e70>
2025-05-07T20:33:00.6764603Z 
2025-05-07T20:33:00.6764809Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6765079Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6765187Z                            module_map=module_map)
2025-05-07T20:33:00.6765347Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6765449Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6765528Z E       ^
2025-05-07T20:33:00.6765885Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6765890Z 
2025-05-07T20:33:00.6766318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6766322Z 
2025-05-07T20:33:00.6766427Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6766695Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6766774Z     T=4096,
2025-05-07T20:33:00.6766851Z     D=5120,
2025-05-07T20:33:00.6766933Z     scale_ub=1200.0,
2025-05-07T20:33:00.6767024Z     contiguous=False,
2025-05-07T20:33:00.6767108Z     compiled=True,
2025-05-07T20:33:00.6767180Z )
2025-05-07T20:33:00.6767403Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6767580Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.6767585Z 
2025-05-07T20:33:00.6767662Z     @given(
2025-05-07T20:33:00.6767777Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6767875Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6767995Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6768112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6768224Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6768300Z     )
2025-05-07T20:33:00.6768548Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6768639Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6768712Z         self,
2025-05-07T20:33:00.6768784Z         T: int,
2025-05-07T20:33:00.6768858Z         D: int,
2025-05-07T20:33:00.6768954Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6769039Z         contiguous: bool,
2025-05-07T20:33:00.6769124Z         compiled: bool,
2025-05-07T20:33:00.6769199Z     ) -> None:
2025-05-07T20:33:00.6769294Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6769367Z     
2025-05-07T20:33:00.6769534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6769603Z     
2025-05-07T20:33:00.6769694Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6769865Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6769952Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6770032Z         x0 = x[:, :D]
2025-05-07T20:33:00.6770107Z         x1 = x[:, D:]
2025-05-07T20:33:00.6770178Z     
2025-05-07T20:33:00.6770259Z         if contiguous:
2025-05-07T20:33:00.6770347Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6770438Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6770505Z     
2025-05-07T20:33:00.6770592Z         if scale_ub is not None:
2025-05-07T20:33:00.6770695Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6770868Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6770941Z             )
2025-05-07T20:33:00.6771015Z         else:
2025-05-07T20:33:00.6771104Z             scale_ub_tensor = None
2025-05-07T20:33:00.6771172Z     
2025-05-07T20:33:00.6771303Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6771392Z             op = silu_mul_quant
2025-05-07T20:33:00.6771472Z             if compiled:
2025-05-07T20:33:00.6771572Z                 op = torch.compile(op)
2025-05-07T20:33:00.6771677Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6771748Z     
2025-05-07T20:33:00.6771875Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6771879Z 
2025-05-07T20:33:00.6771976Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6772101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6772203Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6772303Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6772679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6772768Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6773274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6773371Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6773737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6774027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6774375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6774470Z     kernel = self.compile(
2025-05-07T20:33:00.6774885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6775085Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6775210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6775215Z 
2025-05-07T20:33:00.6775420Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5058fbd90>
2025-05-07T20:33:00.6776218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6776732Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a4700>}
2025-05-07T20:33:00.6777494Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6777689Z context = <triton._C.libtriton.ir.context object at 0x7fd5056924f0>
2025-05-07T20:33:00.6777693Z 
2025-05-07T20:33:00.6777860Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6778232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6778340Z                            module_map=module_map)
2025-05-07T20:33:00.6778501Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6778607Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6778683Z E       ^
2025-05-07T20:33:00.6779043Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6779048Z 
2025-05-07T20:33:00.6779468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6779524Z 
2025-05-07T20:33:00.6779630Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6779860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6779935Z     T=2048,
2025-05-07T20:33:00.6780011Z     D=7168,
2025-05-07T20:33:00.6780091Z     scale_ub=1200.0,
2025-05-07T20:33:00.6780175Z     contiguous=False,
2025-05-07T20:33:00.6780261Z     compiled=False,
2025-05-07T20:33:00.6780332Z )
2025-05-07T20:33:00.6780551Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6780730Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.6780735Z 
2025-05-07T20:33:00.6780853Z     @given(
2025-05-07T20:33:00.6780971Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6781072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6781185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6781303Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6781417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6781490Z     )
2025-05-07T20:33:00.6781744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6781838Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6781913Z         self,
2025-05-07T20:33:00.6781993Z         T: int,
2025-05-07T20:33:00.6782071Z         D: int,
2025-05-07T20:33:00.6782169Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6782259Z         contiguous: bool,
2025-05-07T20:33:00.6782388Z         compiled: bool,
2025-05-07T20:33:00.6782465Z     ) -> None:
2025-05-07T20:33:00.6782566Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6782640Z     
2025-05-07T20:33:00.6782813Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6782886Z     
2025-05-07T20:33:00.6782977Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6783107Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6783195Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6783273Z         x0 = x[:, :D]
2025-05-07T20:33:00.6783352Z         x1 = x[:, D:]
2025-05-07T20:33:00.6783423Z     
2025-05-07T20:33:00.6783505Z         if contiguous:
2025-05-07T20:33:00.6783597Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6783683Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6783755Z     
2025-05-07T20:33:00.6783847Z         if scale_ub is not None:
2025-05-07T20:33:00.6783950Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6784089Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6784168Z             )
2025-05-07T20:33:00.6784244Z         else:
2025-05-07T20:33:00.6784338Z             scale_ub_tensor = None
2025-05-07T20:33:00.6784409Z     
2025-05-07T20:33:00.6784539Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6784629Z             op = silu_mul_quant
2025-05-07T20:33:00.6784710Z             if compiled:
2025-05-07T20:33:00.6784809Z                 op = torch.compile(op)
2025-05-07T20:33:00.6784929Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6785010Z     
2025-05-07T20:33:00.6785109Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6785114Z 
2025-05-07T20:33:00.6785225Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6785395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6785499Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6785597Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6786110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6786210Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6786573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6786795Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6787187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6787279Z     kernel = self.compile(
2025-05-07T20:33:00.6787671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6787849Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6787972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6787979Z 
2025-05-07T20:33:00.6788190Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da184af0>
2025-05-07T20:33:00.6789020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6789539Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a5240>}
2025-05-07T20:33:00.6790307Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6790498Z context = <triton._C.libtriton.ir.context object at 0x7fd505613930>
2025-05-07T20:33:00.6790505Z 
2025-05-07T20:33:00.6790670Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6790980Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6791089Z                            module_map=module_map)
2025-05-07T20:33:00.6791250Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6791346Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6791423Z E       ^
2025-05-07T20:33:00.6791784Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6791789Z 
2025-05-07T20:33:00.6792212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6792216Z 
2025-05-07T20:33:00.6792320Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6792543Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6792624Z     T=1,
2025-05-07T20:33:00.6792701Z     D=7168,
2025-05-07T20:33:00.6792783Z     scale_ub=None,
2025-05-07T20:33:00.6792870Z     contiguous=True,
2025-05-07T20:33:00.6792952Z     compiled=False,
2025-05-07T20:33:00.6793023Z )
2025-05-07T20:33:00.6793245Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6793408Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.6793416Z 
2025-05-07T20:33:00.6793493Z     @given(
2025-05-07T20:33:00.6793609Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6793706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6793825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6793942Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6794094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6794170Z     )
2025-05-07T20:33:00.6794419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6794520Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6794594Z         self,
2025-05-07T20:33:00.6794670Z         T: int,
2025-05-07T20:33:00.6794749Z         D: int,
2025-05-07T20:33:00.6794845Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6794932Z         contiguous: bool,
2025-05-07T20:33:00.6795017Z         compiled: bool,
2025-05-07T20:33:00.6795093Z     ) -> None:
2025-05-07T20:33:00.6795227Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6795299Z     
2025-05-07T20:33:00.6795468Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6795542Z     
2025-05-07T20:33:00.6795634Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6795758Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6795847Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6795928Z         x0 = x[:, :D]
2025-05-07T20:33:00.6796004Z         x1 = x[:, D:]
2025-05-07T20:33:00.6796076Z     
2025-05-07T20:33:00.6796158Z         if contiguous:
2025-05-07T20:33:00.6796246Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6796374Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6796443Z     
2025-05-07T20:33:00.6796531Z         if scale_ub is not None:
2025-05-07T20:33:00.6796636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6796771Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6796842Z             )
2025-05-07T20:33:00.6796919Z         else:
2025-05-07T20:33:00.6797011Z             scale_ub_tensor = None
2025-05-07T20:33:00.6797081Z     
2025-05-07T20:33:00.6797211Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6797298Z             op = silu_mul_quant
2025-05-07T20:33:00.6797381Z             if compiled:
2025-05-07T20:33:00.6797480Z                 op = torch.compile(op)
2025-05-07T20:33:00.6797584Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6797655Z     
2025-05-07T20:33:00.6797743Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6797791Z 
2025-05-07T20:33:00.6797887Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6798022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6798125Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6798224Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6798734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6798833Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6799198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6799422Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6799770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6799869Z     kernel = self.compile(
2025-05-07T20:33:00.6800260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6800437Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6800565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6800569Z 
2025-05-07T20:33:00.6800774Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505640160>
2025-05-07T20:33:00.6801572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6802127Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a6050>}
2025-05-07T20:33:00.6802897Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6803092Z context = <triton._C.libtriton.ir.context object at 0x7fd5056fb8b0>
2025-05-07T20:33:00.6803097Z 
2025-05-07T20:33:00.6803263Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6803534Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6803679Z                            module_map=module_map)
2025-05-07T20:33:00.6803846Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6803944Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6804018Z E       ^
2025-05-07T20:33:00.6804383Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6804388Z 
2025-05-07T20:33:00.6804808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6804815Z 
2025-05-07T20:33:00.6804985Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6805211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6805289Z     T=16384,
2025-05-07T20:33:00.6805365Z     D=7168,
2025-05-07T20:33:00.6805446Z     scale_ub=1200.0,
2025-05-07T20:33:00.6805535Z     contiguous=False,
2025-05-07T20:33:00.6805622Z     compiled=True,
2025-05-07T20:33:00.6805692Z )
2025-05-07T20:33:00.6805911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6806094Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.6806098Z 
2025-05-07T20:33:00.6806172Z     @given(
2025-05-07T20:33:00.6806292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6806397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6806555Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6806677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6806791Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6806864Z     )
2025-05-07T20:33:00.6807118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6807208Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6807284Z         self,
2025-05-07T20:33:00.6807358Z         T: int,
2025-05-07T20:33:00.6807428Z         D: int,
2025-05-07T20:33:00.6807525Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6807616Z         contiguous: bool,
2025-05-07T20:33:00.6807698Z         compiled: bool,
2025-05-07T20:33:00.6807774Z     ) -> None:
2025-05-07T20:33:00.6807867Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6807937Z     
2025-05-07T20:33:00.6808106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6808175Z     
2025-05-07T20:33:00.6808266Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6808392Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6808479Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6808555Z         x0 = x[:, :D]
2025-05-07T20:33:00.6808633Z         x1 = x[:, D:]
2025-05-07T20:33:00.6808703Z     
2025-05-07T20:33:00.6808783Z         if contiguous:
2025-05-07T20:33:00.6808874Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6808962Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6809030Z     
2025-05-07T20:33:00.6809120Z         if scale_ub is not None:
2025-05-07T20:33:00.6809221Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6809357Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6809430Z             )
2025-05-07T20:33:00.6809502Z         else:
2025-05-07T20:33:00.6809640Z             scale_ub_tensor = None
2025-05-07T20:33:00.6809710Z     
2025-05-07T20:33:00.6809838Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6809928Z             op = silu_mul_quant
2025-05-07T20:33:00.6810009Z             if compiled:
2025-05-07T20:33:00.6810107Z                 op = torch.compile(op)
2025-05-07T20:33:00.6810212Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6810280Z     
2025-05-07T20:33:00.6810368Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6810375Z 
2025-05-07T20:33:00.6810472Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6810636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6810736Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6810833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6811208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6811302Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6811806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6811903Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6812308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6812532Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6812882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6812976Z     kernel = self.compile(
2025-05-07T20:33:00.6813364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6813540Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6813666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6813670Z 
2025-05-07T20:33:00.6813875Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5056015d0>
2025-05-07T20:33:00.6814709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6815273Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a7490>}
2025-05-07T20:33:00.6816041Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6816232Z context = <triton._C.libtriton.ir.context object at 0x7fd505aa97f0>
2025-05-07T20:33:00.6816237Z 
2025-05-07T20:33:00.6816407Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6816674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6816781Z                            module_map=module_map)
2025-05-07T20:33:00.6816944Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6817040Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6817115Z E       ^
2025-05-07T20:33:00.6817472Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6817479Z 
2025-05-07T20:33:00.6817900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6817905Z 
2025-05-07T20:33:00.6818007Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6818304Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6818425Z     T=1,
2025-05-07T20:33:00.6818499Z     D=7168,
2025-05-07T20:33:00.6818576Z     scale_ub=None,
2025-05-07T20:33:00.6818663Z     contiguous=False,
2025-05-07T20:33:00.6818746Z     compiled=False,
2025-05-07T20:33:00.6818815Z )
2025-05-07T20:33:00.6819040Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6819206Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.6819211Z 
2025-05-07T20:33:00.6819283Z     @given(
2025-05-07T20:33:00.6819402Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6819540Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6819656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6819773Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6819883Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6819959Z     )
2025-05-07T20:33:00.6820211Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6820301Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6820379Z         self,
2025-05-07T20:33:00.6820458Z         T: int,
2025-05-07T20:33:00.6820529Z         D: int,
2025-05-07T20:33:00.6820666Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6820752Z         contiguous: bool,
2025-05-07T20:33:00.6820834Z         compiled: bool,
2025-05-07T20:33:00.6820914Z     ) -> None:
2025-05-07T20:33:00.6821007Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6821080Z     
2025-05-07T20:33:00.6821250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6821324Z     
2025-05-07T20:33:00.6821414Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6821537Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6821623Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6821703Z         x0 = x[:, :D]
2025-05-07T20:33:00.6821778Z         x1 = x[:, D:]
2025-05-07T20:33:00.6821851Z     
2025-05-07T20:33:00.6821935Z         if contiguous:
2025-05-07T20:33:00.6822024Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6822110Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6822226Z     
2025-05-07T20:33:00.6822313Z         if scale_ub is not None:
2025-05-07T20:33:00.6822417Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6822552Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6822623Z             )
2025-05-07T20:33:00.6822698Z         else:
2025-05-07T20:33:00.6822790Z             scale_ub_tensor = None
2025-05-07T20:33:00.6822861Z     
2025-05-07T20:33:00.6822995Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6823080Z             op = silu_mul_quant
2025-05-07T20:33:00.6823160Z             if compiled:
2025-05-07T20:33:00.6823260Z                 op = torch.compile(op)
2025-05-07T20:33:00.6823361Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6823428Z     
2025-05-07T20:33:00.6823520Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6823525Z 
2025-05-07T20:33:00.6823618Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6823744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6823845Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6823944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6824454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6824547Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6824911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6825150Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6825532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6825672Z     kernel = self.compile(
2025-05-07T20:33:00.6826060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6826236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6826362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6826367Z 
2025-05-07T20:33:00.6826570Z self = <triton.compiler.compiler.ASTSource object at 0x7fd6da13fb20>
2025-05-07T20:33:00.6827361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6827910Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd6da1a77f0>}
2025-05-07T20:33:00.6828674Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6828908Z context = <triton._C.libtriton.ir.context object at 0x7fd505ae1af0>
2025-05-07T20:33:00.6828913Z 
2025-05-07T20:33:00.6829079Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6829351Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6829455Z                            module_map=module_map)
2025-05-07T20:33:00.6829622Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6829721Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6829798Z E       ^
2025-05-07T20:33:00.6830155Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6830162Z 
2025-05-07T20:33:00.6830587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6830633Z 
2025-05-07T20:33:00.6830736Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6830964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6831039Z     T=2048,
2025-05-07T20:33:00.6831112Z     D=7168,
2025-05-07T20:33:00.6831195Z     scale_ub=None,
2025-05-07T20:33:00.6831279Z     contiguous=False,
2025-05-07T20:33:00.6831359Z     compiled=True,
2025-05-07T20:33:00.6831434Z )
2025-05-07T20:33:00.6831655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6831833Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.6831837Z 
2025-05-07T20:33:00.6831910Z     @given(
2025-05-07T20:33:00.6832029Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6832134Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6832248Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6832366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6832487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6832561Z     )
2025-05-07T20:33:00.6832811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6832914Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6832990Z         self,
2025-05-07T20:33:00.6833069Z         T: int,
2025-05-07T20:33:00.6833143Z         D: int,
2025-05-07T20:33:00.6833238Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6833330Z         contiguous: bool,
2025-05-07T20:33:00.6833417Z         compiled: bool,
2025-05-07T20:33:00.6833495Z     ) -> None:
2025-05-07T20:33:00.6833591Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6833662Z     
2025-05-07T20:33:00.6833829Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6833951Z     
2025-05-07T20:33:00.6834042Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6838007Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6838116Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6838195Z         x0 = x[:, :D]
2025-05-07T20:33:00.6838277Z         x1 = x[:, D:]
2025-05-07T20:33:00.6838346Z     
2025-05-07T20:33:00.6838426Z         if contiguous:
2025-05-07T20:33:00.6838516Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6838601Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6838672Z     
2025-05-07T20:33:00.6838761Z         if scale_ub is not None:
2025-05-07T20:33:00.6838955Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6839095Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6839168Z             )
2025-05-07T20:33:00.6839239Z         else:
2025-05-07T20:33:00.6839334Z             scale_ub_tensor = None
2025-05-07T20:33:00.6839411Z     
2025-05-07T20:33:00.6839545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6839635Z             op = silu_mul_quant
2025-05-07T20:33:00.6839716Z             if compiled:
2025-05-07T20:33:00.6839816Z                 op = torch.compile(op)
2025-05-07T20:33:00.6839924Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6840037Z     
2025-05-07T20:33:00.6840130Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6840135Z 
2025-05-07T20:33:00.6840231Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6840359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6840460Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6840560Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6840940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6841039Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6841548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6841648Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6842012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6842281Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6842631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6842722Z     kernel = self.compile(
2025-05-07T20:33:00.6843111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6843288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6843413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6843418Z 
2025-05-07T20:33:00.6843629Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505a327a0>
2025-05-07T20:33:00.6844427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6844969Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505540af0>}
2025-05-07T20:33:00.6845762Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6845956Z context = <triton._C.libtriton.ir.context object at 0x7fd50557abf0>
2025-05-07T20:33:00.6845960Z 
2025-05-07T20:33:00.6846128Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6846440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6846550Z                            module_map=module_map)
2025-05-07T20:33:00.6846714Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6846815Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6846894Z E       ^
2025-05-07T20:33:00.6847257Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6847262Z 
2025-05-07T20:33:00.6847683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6847729Z 
2025-05-07T20:33:00.6847836Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6848061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6848139Z     T=4096,
2025-05-07T20:33:00.6848212Z     D=7168,
2025-05-07T20:33:00.6848293Z     scale_ub=None,
2025-05-07T20:33:00.6848383Z     contiguous=False,
2025-05-07T20:33:00.6848466Z     compiled=True,
2025-05-07T20:33:00.6848537Z )
2025-05-07T20:33:00.6848761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6848980Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.6848985Z 
2025-05-07T20:33:00.6849060Z     @given(
2025-05-07T20:33:00.6849182Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6849279Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6849398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6849518Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6849631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6849708Z     )
2025-05-07T20:33:00.6849956Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6850048Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6850130Z         self,
2025-05-07T20:33:00.6850204Z         T: int,
2025-05-07T20:33:00.6850278Z         D: int,
2025-05-07T20:33:00.6850377Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6850510Z         contiguous: bool,
2025-05-07T20:33:00.6850593Z         compiled: bool,
2025-05-07T20:33:00.6850674Z     ) -> None:
2025-05-07T20:33:00.6850767Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6850840Z     
2025-05-07T20:33:00.6851009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6851080Z     
2025-05-07T20:33:00.6851175Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6851302Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6851389Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6851474Z         x0 = x[:, :D]
2025-05-07T20:33:00.6851551Z         x1 = x[:, D:]
2025-05-07T20:33:00.6851619Z     
2025-05-07T20:33:00.6851703Z         if contiguous:
2025-05-07T20:33:00.6851792Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6851878Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6851948Z     
2025-05-07T20:33:00.6852036Z         if scale_ub is not None:
2025-05-07T20:33:00.6852144Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6852279Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6852351Z             )
2025-05-07T20:33:00.6852424Z         else:
2025-05-07T20:33:00.6852514Z             scale_ub_tensor = None
2025-05-07T20:33:00.6852585Z     
2025-05-07T20:33:00.6852718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6852803Z             op = silu_mul_quant
2025-05-07T20:33:00.6852888Z             if compiled:
2025-05-07T20:33:00.6852989Z                 op = torch.compile(op)
2025-05-07T20:33:00.6853092Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6853161Z     
2025-05-07T20:33:00.6853251Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6853255Z 
2025-05-07T20:33:00.6853350Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6853533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6853633Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6853738Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6854118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6854210Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6854714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6854855Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6855269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6855494Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6856133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6856232Z     kernel = self.compile(
2025-05-07T20:33:00.6856620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6856887Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6857014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6857019Z 
2025-05-07T20:33:00.6857224Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505598400>
2025-05-07T20:33:00.6858021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6858634Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505540280>}
2025-05-07T20:33:00.6859396Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6859658Z context = <triton._C.libtriton.ir.context object at 0x7fd50551d870>
2025-05-07T20:33:00.6859663Z 
2025-05-07T20:33:00.6859831Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6860098Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6860206Z                            module_map=module_map)
2025-05-07T20:33:00.6860371Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6860469Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6860541Z E       ^
2025-05-07T20:33:00.6860903Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6860908Z 
2025-05-07T20:33:00.6861331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6861339Z 
2025-05-07T20:33:00.6861443Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6861670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6861746Z     T=16384,
2025-05-07T20:33:00.6861820Z     D=5120,
2025-05-07T20:33:00.6861899Z     scale_ub=1200.0,
2025-05-07T20:33:00.6861982Z     contiguous=False,
2025-05-07T20:33:00.6862066Z     compiled=False,
2025-05-07T20:33:00.6862134Z )
2025-05-07T20:33:00.6862357Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6862537Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.6862542Z 
2025-05-07T20:33:00.6862614Z     @given(
2025-05-07T20:33:00.6862797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6862895Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6863008Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6863135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6863250Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6863320Z     )
2025-05-07T20:33:00.6863570Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6863659Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6863735Z         self,
2025-05-07T20:33:00.6863807Z         T: int,
2025-05-07T20:33:00.6863939Z         D: int,
2025-05-07T20:33:00.6864036Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6864122Z         contiguous: bool,
2025-05-07T20:33:00.6864204Z         compiled: bool,
2025-05-07T20:33:00.6864280Z     ) -> None:
2025-05-07T20:33:00.6864373Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6864440Z     
2025-05-07T20:33:00.6864616Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6864687Z     
2025-05-07T20:33:00.6864775Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6864905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6864989Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6865110Z         x0 = x[:, :D]
2025-05-07T20:33:00.6865188Z         x1 = x[:, D:]
2025-05-07T20:33:00.6865256Z     
2025-05-07T20:33:00.6865337Z         if contiguous:
2025-05-07T20:33:00.6865427Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6865512Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6865587Z     
2025-05-07T20:33:00.6865674Z         if scale_ub is not None:
2025-05-07T20:33:00.6865777Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6865913Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6865983Z             )
2025-05-07T20:33:00.6866054Z         else:
2025-05-07T20:33:00.6866151Z             scale_ub_tensor = None
2025-05-07T20:33:00.6866222Z     
2025-05-07T20:33:00.6866350Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6866440Z             op = silu_mul_quant
2025-05-07T20:33:00.6866564Z             if compiled:
2025-05-07T20:33:00.6866662Z                 op = torch.compile(op)
2025-05-07T20:33:00.6866768Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6866836Z     
2025-05-07T20:33:00.6866928Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6866933Z 
2025-05-07T20:33:00.6867027Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6867151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6867254Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6867351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6867864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6867960Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6868325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6868554Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6868902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6868993Z     kernel = self.compile(
2025-05-07T20:33:00.6869383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6869559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6869691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6869696Z 
2025-05-07T20:33:00.6869901Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505ac9120>
2025-05-07T20:33:00.6870765Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6871285Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505542d40>}
2025-05-07T20:33:00.6872043Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6872276Z context = <triton._C.libtriton.ir.context object at 0x7fd505946bb0>
2025-05-07T20:33:00.6872281Z 
2025-05-07T20:33:00.6872446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6872711Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6872822Z                            module_map=module_map)
2025-05-07T20:33:00.6872980Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6873082Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6873158Z E       ^
2025-05-07T20:33:00.6873557Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6873562Z 
2025-05-07T20:33:00.6873986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6873991Z 
2025-05-07T20:33:00.6874093Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6874324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6874397Z     T=16384,
2025-05-07T20:33:00.6874469Z     D=5120,
2025-05-07T20:33:00.6874551Z     scale_ub=1200.0,
2025-05-07T20:33:00.6874632Z     contiguous=True,
2025-05-07T20:33:00.6874712Z     compiled=True,
2025-05-07T20:33:00.6874785Z )
2025-05-07T20:33:00.6875035Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6875229Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.6875275Z 
2025-05-07T20:33:00.6875353Z     @given(
2025-05-07T20:33:00.6875471Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6875571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6875683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6875801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6875919Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6875990Z     )
2025-05-07T20:33:00.6876237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6876333Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6876404Z         self,
2025-05-07T20:33:00.6876476Z         T: int,
2025-05-07T20:33:00.6876550Z         D: int,
2025-05-07T20:33:00.6876646Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6876734Z         contiguous: bool,
2025-05-07T20:33:00.6876818Z         compiled: bool,
2025-05-07T20:33:00.6876894Z     ) -> None:
2025-05-07T20:33:00.6876989Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6877058Z     
2025-05-07T20:33:00.6877228Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6877300Z     
2025-05-07T20:33:00.6877390Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6877512Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6877600Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6877678Z         x0 = x[:, :D]
2025-05-07T20:33:00.6877754Z         x1 = x[:, D:]
2025-05-07T20:33:00.6877826Z     
2025-05-07T20:33:00.6877907Z         if contiguous:
2025-05-07T20:33:00.6877996Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6878083Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6878150Z     
2025-05-07T20:33:00.6878285Z         if scale_ub is not None:
2025-05-07T20:33:00.6878387Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6878519Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6878598Z             )
2025-05-07T20:33:00.6878670Z         else:
2025-05-07T20:33:00.6878762Z             scale_ub_tensor = None
2025-05-07T20:33:00.6878833Z     
2025-05-07T20:33:00.6878961Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6879047Z             op = silu_mul_quant
2025-05-07T20:33:00.6879130Z             if compiled:
2025-05-07T20:33:00.6879226Z                 op = torch.compile(op)
2025-05-07T20:33:00.6879371Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6879442Z     
2025-05-07T20:33:00.6879528Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6879532Z 
2025-05-07T20:33:00.6879631Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6879754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6879856Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6879956Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6880330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6880462Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6880970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6881067Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6881431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6881656Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6882001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6882094Z     kernel = self.compile(
2025-05-07T20:33:00.6882483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6882657Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6882829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6882834Z 
2025-05-07T20:33:00.6883038Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5055e1630>
2025-05-07T20:33:00.6883832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6884344Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505542830>}
2025-05-07T20:33:00.6885149Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6885350Z context = <triton._C.libtriton.ir.context object at 0x7fd505983670>
2025-05-07T20:33:00.6885355Z 
2025-05-07T20:33:00.6885520Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6885788Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6885894Z                            module_map=module_map)
2025-05-07T20:33:00.6886056Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6886157Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6886232Z E       ^
2025-05-07T20:33:00.6886592Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6886597Z 
2025-05-07T20:33:00.6887057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6887062Z 
2025-05-07T20:33:00.6887166Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6887396Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6887473Z     T=16384,
2025-05-07T20:33:00.6887550Z     D=5120,
2025-05-07T20:33:00.6887629Z     scale_ub=None,
2025-05-07T20:33:00.6887714Z     contiguous=False,
2025-05-07T20:33:00.6887798Z     compiled=True,
2025-05-07T20:33:00.6887866Z )
2025-05-07T20:33:00.6888082Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6888301Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.6888306Z 
2025-05-07T20:33:00.6888379Z     @given(
2025-05-07T20:33:00.6888494Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6888595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6888709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6888827Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6888938Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6889011Z     )
2025-05-07T20:33:00.6889300Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6889391Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6889464Z         self,
2025-05-07T20:33:00.6889545Z         T: int,
2025-05-07T20:33:00.6889618Z         D: int,
2025-05-07T20:33:00.6889714Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6889807Z         contiguous: bool,
2025-05-07T20:33:00.6889888Z         compiled: bool,
2025-05-07T20:33:00.6889961Z     ) -> None:
2025-05-07T20:33:00.6890054Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6890124Z     
2025-05-07T20:33:00.6890296Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6890366Z     
2025-05-07T20:33:00.6890458Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6890582Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6890667Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6890787Z         x0 = x[:, :D]
2025-05-07T20:33:00.6890864Z         x1 = x[:, D:]
2025-05-07T20:33:00.6890933Z     
2025-05-07T20:33:00.6891015Z         if contiguous:
2025-05-07T20:33:00.6891107Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6891191Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6891259Z     
2025-05-07T20:33:00.6891348Z         if scale_ub is not None:
2025-05-07T20:33:00.6891452Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6891591Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6891662Z             )
2025-05-07T20:33:00.6891733Z         else:
2025-05-07T20:33:00.6891826Z             scale_ub_tensor = None
2025-05-07T20:33:00.6891897Z     
2025-05-07T20:33:00.6892026Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6892116Z             op = silu_mul_quant
2025-05-07T20:33:00.6892198Z             if compiled:
2025-05-07T20:33:00.6892293Z                 op = torch.compile(op)
2025-05-07T20:33:00.6892400Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6892468Z     
2025-05-07T20:33:00.6892556Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6892561Z 
2025-05-07T20:33:00.6892658Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6892780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6892880Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6892977Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6893354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6893446Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6893994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6894089Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6894452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6894679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6895026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6895117Z     kernel = self.compile(
2025-05-07T20:33:00.6895504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6895722Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6895843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6895847Z 
2025-05-07T20:33:00.6896053Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505589270>
2025-05-07T20:33:00.6896846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6897400Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd505543760>}
2025-05-07T20:33:00.6898240Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6898437Z context = <triton._C.libtriton.ir.context object at 0x7fd505499f70>
2025-05-07T20:33:00.6898445Z 
2025-05-07T20:33:00.6898608Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6898880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6898987Z                            module_map=module_map)
2025-05-07T20:33:00.6899146Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6899311Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6899386Z E       ^
2025-05-07T20:33:00.6899748Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6899752Z 
2025-05-07T20:33:00.6900175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6900182Z 
2025-05-07T20:33:00.6900284Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6900507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6900584Z     T=2048,
2025-05-07T20:33:00.6900657Z     D=5120,
2025-05-07T20:33:00.6900735Z     scale_ub=None,
2025-05-07T20:33:00.6900820Z     contiguous=False,
2025-05-07T20:33:00.6900901Z     compiled=True,
2025-05-07T20:33:00.6900972Z )
2025-05-07T20:33:00.6901191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6901366Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.6901373Z 
2025-05-07T20:33:00.6901450Z     @given(
2025-05-07T20:33:00.6901568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6901665Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6901779Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6901894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6902008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6902084Z     )
2025-05-07T20:33:00.6902330Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6902424Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6902496Z         self,
2025-05-07T20:33:00.6902612Z         T: int,
2025-05-07T20:33:00.6902688Z         D: int,
2025-05-07T20:33:00.6902782Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6902866Z         contiguous: bool,
2025-05-07T20:33:00.6902953Z         compiled: bool,
2025-05-07T20:33:00.6903029Z     ) -> None:
2025-05-07T20:33:00.6903123Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6903194Z     
2025-05-07T20:33:00.6903360Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6903430Z     
2025-05-07T20:33:00.6903521Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6903643Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6903771Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6903851Z         x0 = x[:, :D]
2025-05-07T20:33:00.6903926Z         x1 = x[:, D:]
2025-05-07T20:33:00.6903999Z     
2025-05-07T20:33:00.6904079Z         if contiguous:
2025-05-07T20:33:00.6904166Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6904252Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6904324Z     
2025-05-07T20:33:00.6904412Z         if scale_ub is not None:
2025-05-07T20:33:00.6904518Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6904651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6904723Z             )
2025-05-07T20:33:00.6904858Z         else:
2025-05-07T20:33:00.6904963Z             scale_ub_tensor = None
2025-05-07T20:33:00.6905044Z     
2025-05-07T20:33:00.6905189Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6905274Z             op = silu_mul_quant
2025-05-07T20:33:00.6905356Z             if compiled:
2025-05-07T20:33:00.6905456Z                 op = torch.compile(op)
2025-05-07T20:33:00.6905559Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6905629Z     
2025-05-07T20:33:00.6905715Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6905720Z 
2025-05-07T20:33:00.6905814Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6905943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6906040Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6906138Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6906560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6906652Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6907156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6907252Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6907615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6907839Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6908182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6908274Z     kernel = self.compile(
2025-05-07T20:33:00.6908661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6908838Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6908963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6908968Z 
2025-05-07T20:33:00.6909173Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5059ec640>
2025-05-07T20:33:00.6909961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6910474Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5054cc3a0>}
2025-05-07T20:33:00.6911272Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6911472Z context = <triton._C.libtriton.ir.context object at 0x7fd505488c30>
2025-05-07T20:33:00.6911477Z 
2025-05-07T20:33:00.6911642Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6911910Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6912019Z                            module_map=module_map)
2025-05-07T20:33:00.6912231Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6912331Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6912405Z E       ^
2025-05-07T20:33:00.6912762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6912767Z 
2025-05-07T20:33:00.6913195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6913202Z 
2025-05-07T20:33:00.6913303Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6913566Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6913642Z     T=2048,
2025-05-07T20:33:00.6913714Z     D=5120,
2025-05-07T20:33:00.6913795Z     scale_ub=1200.0,
2025-05-07T20:33:00.6913878Z     contiguous=False,
2025-05-07T20:33:00.6913961Z     compiled=True,
2025-05-07T20:33:00.6914032Z )
2025-05-07T20:33:00.6914255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6914428Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.6914436Z 
2025-05-07T20:33:00.6914507Z     @given(
2025-05-07T20:33:00.6914622Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6914723Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6914834Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6914948Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6915129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6915217Z     )
2025-05-07T20:33:00.6915477Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6915569Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6915642Z         self,
2025-05-07T20:33:00.6915714Z         T: int,
2025-05-07T20:33:00.6915788Z         D: int,
2025-05-07T20:33:00.6915887Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6915976Z         contiguous: bool,
2025-05-07T20:33:00.6916056Z         compiled: bool,
2025-05-07T20:33:00.6916130Z     ) -> None:
2025-05-07T20:33:00.6916228Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6916296Z     
2025-05-07T20:33:00.6916464Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6916539Z     
2025-05-07T20:33:00.6916628Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6916749Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6916843Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6916918Z         x0 = x[:, :D]
2025-05-07T20:33:00.6916998Z         x1 = x[:, D:]
2025-05-07T20:33:00.6917070Z     
2025-05-07T20:33:00.6917150Z         if contiguous:
2025-05-07T20:33:00.6917240Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6917325Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6917392Z     
2025-05-07T20:33:00.6917481Z         if scale_ub is not None:
2025-05-07T20:33:00.6917585Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6917718Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6917792Z             )
2025-05-07T20:33:00.6917863Z         else:
2025-05-07T20:33:00.6917952Z             scale_ub_tensor = None
2025-05-07T20:33:00.6918024Z     
2025-05-07T20:33:00.6918196Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6918284Z             op = silu_mul_quant
2025-05-07T20:33:00.6918367Z             if compiled:
2025-05-07T20:33:00.6918465Z                 op = torch.compile(op)
2025-05-07T20:33:00.6918572Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6918640Z     
2025-05-07T20:33:00.6918727Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6918732Z 
2025-05-07T20:33:00.6918829Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6918952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6919092Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6919190Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6919561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6919652Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6920156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6920250Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6920616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6920875Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6921220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6921313Z     kernel = self.compile(
2025-05-07T20:33:00.6921698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6921880Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6922000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6922005Z 
2025-05-07T20:33:00.6922213Z self = <triton.compiler.compiler.ASTSource object at 0x7fd50597bfa0>
2025-05-07T20:33:00.6923006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6923555Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5054cc820>}
2025-05-07T20:33:00.6924315Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6924509Z context = <triton._C.libtriton.ir.context object at 0x7fd505334b30>
2025-05-07T20:33:00.6924513Z 
2025-05-07T20:33:00.6924680Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6924948Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6925056Z                            module_map=module_map)
2025-05-07T20:33:00.6925219Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6925318Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6925392Z E       ^
2025-05-07T20:33:00.6925752Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6925757Z 
2025-05-07T20:33:00.6926178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6926185Z 
2025-05-07T20:33:00.6926289Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6926512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6926586Z     T=4096,
2025-05-07T20:33:00.6926661Z     D=5120,
2025-05-07T20:33:00.6926783Z     scale_ub=1200.0,
2025-05-07T20:33:00.6926864Z     contiguous=True,
2025-05-07T20:33:00.6926947Z     compiled=True,
2025-05-07T20:33:00.6927016Z )
2025-05-07T20:33:00.6927235Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6927411Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.6927416Z 
2025-05-07T20:33:00.6927488Z     @given(
2025-05-07T20:33:00.6927605Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6927702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6927814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6927972Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6928082Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6928152Z     )
2025-05-07T20:33:00.6928400Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6928491Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6928566Z         self,
2025-05-07T20:33:00.6928643Z         T: int,
2025-05-07T20:33:00.6928714Z         D: int,
2025-05-07T20:33:00.6928810Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6928901Z         contiguous: bool,
2025-05-07T20:33:00.6928984Z         compiled: bool,
2025-05-07T20:33:00.6929124Z     ) -> None:
2025-05-07T20:33:00.6929219Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6929289Z     
2025-05-07T20:33:00.6929460Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6929530Z     
2025-05-07T20:33:00.6929619Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6929746Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6929830Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6929906Z         x0 = x[:, :D]
2025-05-07T20:33:00.6929990Z         x1 = x[:, D:]
2025-05-07T20:33:00.6930059Z     
2025-05-07T20:33:00.6930138Z         if contiguous:
2025-05-07T20:33:00.6930229Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6930315Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6930386Z     
2025-05-07T20:33:00.6930473Z         if scale_ub is not None:
2025-05-07T20:33:00.6930618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6930755Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6930826Z             )
2025-05-07T20:33:00.6930897Z         else:
2025-05-07T20:33:00.6930990Z             scale_ub_tensor = None
2025-05-07T20:33:00.6931060Z     
2025-05-07T20:33:00.6931187Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6931282Z             op = silu_mul_quant
2025-05-07T20:33:00.6931362Z             if compiled:
2025-05-07T20:33:00.6931457Z                 op = torch.compile(op)
2025-05-07T20:33:00.6931562Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6931630Z     
2025-05-07T20:33:00.6931721Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6931725Z 
2025-05-07T20:33:00.6931827Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6931951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6932054Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6932154Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6932531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6932623Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6933123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6933223Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6933585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6933806Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6934197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6934290Z     kernel = self.compile(
2025-05-07T20:33:00.6934676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6934856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6934978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6934983Z 
2025-05-07T20:33:00.6935221Z self = <triton.compiler.compiler.ASTSource object at 0x7fd50540cca0>
2025-05-07T20:33:00.6936027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6936579Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5054cd360>}
2025-05-07T20:33:00.6937341Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6937572Z context = <triton._C.libtriton.ir.context object at 0x7fd50536e4f0>
2025-05-07T20:33:00.6937577Z 
2025-05-07T20:33:00.6937745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6938013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6938207Z                            module_map=module_map)
2025-05-07T20:33:00.6938366Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6938462Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6938539Z E       ^
2025-05-07T20:33:00.6938899Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6938904Z 
2025-05-07T20:33:00.6939322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6939370Z 
2025-05-07T20:33:00.6939479Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6939702Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6939780Z     T=128,
2025-05-07T20:33:00.6939854Z     D=5120,
2025-05-07T20:33:00.6939936Z     scale_ub=1200.0,
2025-05-07T20:33:00.6940022Z     contiguous=False,
2025-05-07T20:33:00.6940106Z     compiled=True,
2025-05-07T20:33:00.6940174Z )
2025-05-07T20:33:00.6940394Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6940563Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.6940567Z 
2025-05-07T20:33:00.6940641Z     @given(
2025-05-07T20:33:00.6940760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6940855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6940970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6941087Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6941199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6941273Z     )
2025-05-07T20:33:00.6941518Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6941607Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6941684Z         self,
2025-05-07T20:33:00.6941756Z         T: int,
2025-05-07T20:33:00.6941831Z         D: int,
2025-05-07T20:33:00.6941930Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6942016Z         contiguous: bool,
2025-05-07T20:33:00.6942098Z         compiled: bool,
2025-05-07T20:33:00.6942175Z     ) -> None:
2025-05-07T20:33:00.6942264Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6942336Z     
2025-05-07T20:33:00.6942549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6942620Z     
2025-05-07T20:33:00.6942713Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6942842Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6942925Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6943005Z         x0 = x[:, :D]
2025-05-07T20:33:00.6943083Z         x1 = x[:, D:]
2025-05-07T20:33:00.6943151Z     
2025-05-07T20:33:00.6943235Z         if contiguous:
2025-05-07T20:33:00.6943325Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6943410Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6943521Z     
2025-05-07T20:33:00.6943608Z         if scale_ub is not None:
2025-05-07T20:33:00.6943715Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6943849Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6943921Z             )
2025-05-07T20:33:00.6943997Z         else:
2025-05-07T20:33:00.6944088Z             scale_ub_tensor = None
2025-05-07T20:33:00.6944161Z     
2025-05-07T20:33:00.6944293Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6944383Z             op = silu_mul_quant
2025-05-07T20:33:00.6944466Z             if compiled:
2025-05-07T20:33:00.6944565Z                 op = torch.compile(op)
2025-05-07T20:33:00.6944707Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6944778Z     
2025-05-07T20:33:00.6944868Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6944873Z 
2025-05-07T20:33:00.6944968Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6945096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6945200Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6945312Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6945724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6945814Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6946318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6946461Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6946827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6947050Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6947396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6947490Z     kernel = self.compile(
2025-05-07T20:33:00.6947878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6948053Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6948172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6948187Z 
2025-05-07T20:33:00.6948390Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5054e8d90>
2025-05-07T20:33:00.6949180Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6949693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5054ce290>}
2025-05-07T20:33:00.6950452Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6950645Z context = <triton._C.libtriton.ir.context object at 0x7fd505330d70>
2025-05-07T20:33:00.6950650Z 
2025-05-07T20:33:00.6950853Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6951121Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6951229Z                            module_map=module_map)
2025-05-07T20:33:00.6951391Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6951489Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6951561Z E       ^
2025-05-07T20:33:00.6951918Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6951922Z 
2025-05-07T20:33:00.6952386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6952391Z 
2025-05-07T20:33:00.6952493Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6952717Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6952796Z     T=16384,
2025-05-07T20:33:00.6952870Z     D=7168,
2025-05-07T20:33:00.6952952Z     scale_ub=1200.0,
2025-05-07T20:33:00.6953032Z     contiguous=True,
2025-05-07T20:33:00.6953110Z     compiled=True,
2025-05-07T20:33:00.6953184Z )
2025-05-07T20:33:00.6953444Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6953619Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.6953624Z 
2025-05-07T20:33:00.6953700Z     @given(
2025-05-07T20:33:00.6953815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6953910Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6954028Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6954143Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6954255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6954328Z     )
2025-05-07T20:33:00.6954577Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6954671Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6954744Z         self,
2025-05-07T20:33:00.6954817Z         T: int,
2025-05-07T20:33:00.6954934Z         D: int,
2025-05-07T20:33:00.6955029Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6955119Z         contiguous: bool,
2025-05-07T20:33:00.6955205Z         compiled: bool,
2025-05-07T20:33:00.6955279Z     ) -> None:
2025-05-07T20:33:00.6955373Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6955448Z     
2025-05-07T20:33:00.6955869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6955988Z     
2025-05-07T20:33:00.6956115Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6956282Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6956411Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6956501Z         x0 = x[:, :D]
2025-05-07T20:33:00.6956579Z         x1 = x[:, D:]
2025-05-07T20:33:00.6956646Z     
2025-05-07T20:33:00.6956728Z         if contiguous:
2025-05-07T20:33:00.6956816Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6956906Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6956981Z     
2025-05-07T20:33:00.6960958Z         if scale_ub is not None:
2025-05-07T20:33:00.6961086Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6961222Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6961299Z             )
2025-05-07T20:33:00.6961371Z         else:
2025-05-07T20:33:00.6961462Z             scale_ub_tensor = None
2025-05-07T20:33:00.6961533Z     
2025-05-07T20:33:00.6961663Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6961757Z             op = silu_mul_quant
2025-05-07T20:33:00.6961839Z             if compiled:
2025-05-07T20:33:00.6961939Z                 op = torch.compile(op)
2025-05-07T20:33:00.6962046Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6962116Z     
2025-05-07T20:33:00.6962204Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6962319Z 
2025-05-07T20:33:00.6962423Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6962555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6962657Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6962762Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6963141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.6963235Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.6963738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6963931Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6964296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6964518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6964878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6964989Z     kernel = self.compile(
2025-05-07T20:33:00.6965464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6965644Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6965770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6965774Z 
2025-05-07T20:33:00.6965982Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5053f1510>
2025-05-07T20:33:00.6966784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6967296Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5054ced40>}
2025-05-07T20:33:00.6968061Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6968314Z context = <triton._C.libtriton.ir.context object at 0x7fd50522f670>
2025-05-07T20:33:00.6968319Z 
2025-05-07T20:33:00.6968486Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6968753Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6968860Z                            module_map=module_map)
2025-05-07T20:33:00.6969024Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6969121Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6969196Z E       ^
2025-05-07T20:33:00.6969562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6969566Z 
2025-05-07T20:33:00.6969990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6969997Z 
2025-05-07T20:33:00.6970100Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6970324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6970397Z     T=16384,
2025-05-07T20:33:00.6970475Z     D=5120,
2025-05-07T20:33:00.6970554Z     scale_ub=1200.0,
2025-05-07T20:33:00.6970639Z     contiguous=True,
2025-05-07T20:33:00.6970723Z     compiled=False,
2025-05-07T20:33:00.6970795Z )
2025-05-07T20:33:00.6971014Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6971194Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.6971198Z 
2025-05-07T20:33:00.6971314Z     @given(
2025-05-07T20:33:00.6971435Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6971531Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6971647Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6971767Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6971878Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6971949Z     )
2025-05-07T20:33:00.6972204Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6972296Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6972415Z         self,
2025-05-07T20:33:00.6972487Z         T: int,
2025-05-07T20:33:00.6972559Z         D: int,
2025-05-07T20:33:00.6972658Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6972744Z         contiguous: bool,
2025-05-07T20:33:00.6972825Z         compiled: bool,
2025-05-07T20:33:00.6972905Z     ) -> None:
2025-05-07T20:33:00.6973001Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6973070Z     
2025-05-07T20:33:00.6973243Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6973317Z     
2025-05-07T20:33:00.6973410Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6973580Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6973665Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6973742Z         x0 = x[:, :D]
2025-05-07T20:33:00.6973821Z         x1 = x[:, D:]
2025-05-07T20:33:00.6973889Z     
2025-05-07T20:33:00.6973972Z         if contiguous:
2025-05-07T20:33:00.6974060Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6974150Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6974223Z     
2025-05-07T20:33:00.6974312Z         if scale_ub is not None:
2025-05-07T20:33:00.6974414Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6974553Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6974625Z             )
2025-05-07T20:33:00.6974700Z         else:
2025-05-07T20:33:00.6974793Z             scale_ub_tensor = None
2025-05-07T20:33:00.6974862Z     
2025-05-07T20:33:00.6974999Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6975151Z             op = silu_mul_quant
2025-05-07T20:33:00.6975252Z             if compiled:
2025-05-07T20:33:00.6975357Z                 op = torch.compile(op)
2025-05-07T20:33:00.6975459Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6975527Z     
2025-05-07T20:33:00.6975615Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6975619Z 
2025-05-07T20:33:00.6975713Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6975841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6975945Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6976042Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6976550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6976649Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6977010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6977239Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6977585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6977678Z     kernel = self.compile(
2025-05-07T20:33:00.6978155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6978337Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6978459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6978463Z 
2025-05-07T20:33:00.6978667Z self = <triton.compiler.compiler.ASTSource object at 0x7fd5053f1360>
2025-05-07T20:33:00.6979510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6980027Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5054cfac0>}
2025-05-07T20:33:00.6980791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6981020Z context = <triton._C.libtriton.ir.context object at 0x7fd5052b4bf0>
2025-05-07T20:33:00.6981025Z 
2025-05-07T20:33:00.6981193Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6981462Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6981567Z                            module_map=module_map)
2025-05-07T20:33:00.6981729Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6981827Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6981940Z E       ^
2025-05-07T20:33:00.6982302Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6982307Z 
2025-05-07T20:33:00.6982726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6982734Z 
2025-05-07T20:33:00.6982840Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6983063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6983135Z     T=1,
2025-05-07T20:33:00.6983210Z     D=7168,
2025-05-07T20:33:00.6983290Z     scale_ub=1200.0,
2025-05-07T20:33:00.6983378Z     contiguous=False,
2025-05-07T20:33:00.6983462Z     compiled=False,
2025-05-07T20:33:00.6983532Z )
2025-05-07T20:33:00.6983752Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6983964Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.6983974Z 
2025-05-07T20:33:00.6984049Z     @given(
2025-05-07T20:33:00.6984168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6984264Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6984378Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6984499Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6984608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6984681Z     )
2025-05-07T20:33:00.6984927Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6985017Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6985095Z         self,
2025-05-07T20:33:00.6985169Z         T: int,
2025-05-07T20:33:00.6985241Z         D: int,
2025-05-07T20:33:00.6985338Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6985423Z         contiguous: bool,
2025-05-07T20:33:00.6985507Z         compiled: bool,
2025-05-07T20:33:00.6985584Z     ) -> None:
2025-05-07T20:33:00.6985678Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6985746Z     
2025-05-07T20:33:00.6985915Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6985985Z     
2025-05-07T20:33:00.6986073Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6986202Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6986288Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6986367Z         x0 = x[:, :D]
2025-05-07T20:33:00.6986442Z         x1 = x[:, D:]
2025-05-07T20:33:00.6986510Z     
2025-05-07T20:33:00.6986592Z         if contiguous:
2025-05-07T20:33:00.6986679Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6986809Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6986880Z     
2025-05-07T20:33:00.6986968Z         if scale_ub is not None:
2025-05-07T20:33:00.6987069Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6987208Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6987282Z             )
2025-05-07T20:33:00.6987353Z         else:
2025-05-07T20:33:00.6987446Z             scale_ub_tensor = None
2025-05-07T20:33:00.6987515Z     
2025-05-07T20:33:00.6987645Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.6987732Z             op = silu_mul_quant
2025-05-07T20:33:00.6987854Z             if compiled:
2025-05-07T20:33:00.6987951Z                 op = torch.compile(op)
2025-05-07T20:33:00.6988053Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6988125Z     
2025-05-07T20:33:00.6988214Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.6988218Z 
2025-05-07T20:33:00.6988313Z moe/activation_test.py:117: 
2025-05-07T20:33:00.6988440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6988541Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.6988637Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.6989192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.6989288Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.6989651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.6989875Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.6990226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.6990317Z     kernel = self.compile(
2025-05-07T20:33:00.6990707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.6990883Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.6991008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.6991052Z 
2025-05-07T20:33:00.6991260Z self = <triton.compiler.compiler.ASTSource object at 0x7fd50502d840>
2025-05-07T20:33:00.6992053Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.6992566Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5050484c0>}
2025-05-07T20:33:00.6993329Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.6993522Z context = <triton._C.libtriton.ir.context object at 0x7fd5050ec630>
2025-05-07T20:33:00.6993527Z 
2025-05-07T20:33:00.6993694Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.6993964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.6994070Z                            module_map=module_map)
2025-05-07T20:33:00.6994230Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.6994334Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.6994413Z E       ^
2025-05-07T20:33:00.6994774Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.6994779Z 
2025-05-07T20:33:00.6995254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.6995259Z 
2025-05-07T20:33:00.6995427Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.6995656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.6995733Z     T=4096,
2025-05-07T20:33:00.6995807Z     D=7168,
2025-05-07T20:33:00.6995890Z     scale_ub=1200.0,
2025-05-07T20:33:00.6995977Z     contiguous=False,
2025-05-07T20:33:00.6996059Z     compiled=True,
2025-05-07T20:33:00.6996133Z )
2025-05-07T20:33:00.6996351Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.6996528Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.6996577Z 
2025-05-07T20:33:00.6996653Z     @given(
2025-05-07T20:33:00.6996770Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.6996871Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.6996983Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.6997100Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.6997217Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.6997286Z     )
2025-05-07T20:33:00.6997531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.6997628Z     def test_silu_mul_quant(
2025-05-07T20:33:00.6997740Z         self,
2025-05-07T20:33:00.6997814Z         T: int,
2025-05-07T20:33:00.6997890Z         D: int,
2025-05-07T20:33:00.6997986Z         scale_ub: Optional[float],
2025-05-07T20:33:00.6998074Z         contiguous: bool,
2025-05-07T20:33:00.6998155Z         compiled: bool,
2025-05-07T20:33:00.6998229Z     ) -> None:
2025-05-07T20:33:00.6998325Z         torch.manual_seed(2025)
2025-05-07T20:33:00.6998394Z     
2025-05-07T20:33:00.6998560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.6998637Z     
2025-05-07T20:33:00.6998726Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.6998848Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.6998937Z         x = x_sign * x_clamp
2025-05-07T20:33:00.6999013Z         x0 = x[:, :D]
2025-05-07T20:33:00.6999090Z         x1 = x[:, D:]
2025-05-07T20:33:00.6999204Z     
2025-05-07T20:33:00.6999284Z         if contiguous:
2025-05-07T20:33:00.6999373Z             x0 = x0.contiguous()
2025-05-07T20:33:00.6999464Z             x1 = x1.contiguous()
2025-05-07T20:33:00.6999532Z     
2025-05-07T20:33:00.6999623Z         if scale_ub is not None:
2025-05-07T20:33:00.6999725Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.6999857Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.6999934Z             )
2025-05-07T20:33:00.7000006Z         else:
2025-05-07T20:33:00.7000095Z             scale_ub_tensor = None
2025-05-07T20:33:00.7000170Z     
2025-05-07T20:33:00.7000304Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.7000391Z             op = silu_mul_quant
2025-05-07T20:33:00.7000473Z             if compiled:
2025-05-07T20:33:00.7000570Z                 op = torch.compile(op)
2025-05-07T20:33:00.7000672Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7000741Z     
2025-05-07T20:33:00.7000832Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.7000837Z 
2025-05-07T20:33:00.7000933Z moe/activation_test.py:117: 
2025-05-07T20:33:00.7001059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7001157Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.7001255Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7001629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.7001720Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.7002223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.7002317Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.7002726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.7002950Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.7003300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.7003396Z     kernel = self.compile(
2025-05-07T20:33:00.7003782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.7003961Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.7004122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7004126Z 
2025-05-07T20:33:00.7004331Z self = <triton.compiler.compiler.ASTSource object at 0x7fd50502efe0>
2025-05-07T20:33:00.7005160Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.7005735Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5050491b0>}
2025-05-07T20:33:00.7006501Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.7006692Z context = <triton._C.libtriton.ir.context object at 0x7fd505060c30>
2025-05-07T20:33:00.7006700Z 
2025-05-07T20:33:00.7006864Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.7007133Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.7007238Z                            module_map=module_map)
2025-05-07T20:33:00.7007404Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.7007501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.7007618Z E       ^
2025-05-07T20:33:00.7007983Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.7007987Z 
2025-05-07T20:33:00.7008408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.7008412Z 
2025-05-07T20:33:00.7008522Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7008748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7008823Z     T=128,
2025-05-07T20:33:00.7008899Z     D=7168,
2025-05-07T20:33:00.7008980Z     scale_ub=1200.0,
2025-05-07T20:33:00.7009066Z     contiguous=False,
2025-05-07T20:33:00.7009150Z     compiled=True,
2025-05-07T20:33:00.7009220Z )
2025-05-07T20:33:00.7009440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7009614Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.7009622Z 
2025-05-07T20:33:00.7009695Z     @given(
2025-05-07T20:33:00.7009815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7009910Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7010022Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7010138Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7010248Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7010322Z     )
2025-05-07T20:33:00.7010572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7010662Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7010733Z         self,
2025-05-07T20:33:00.7010807Z         T: int,
2025-05-07T20:33:00.7010878Z         D: int,
2025-05-07T20:33:00.7011019Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7011109Z         contiguous: bool,
2025-05-07T20:33:00.7011190Z         compiled: bool,
2025-05-07T20:33:00.7011267Z     ) -> None:
2025-05-07T20:33:00.7011361Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7011430Z     
2025-05-07T20:33:00.7011603Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7011672Z     
2025-05-07T20:33:00.7011761Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7011889Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7011974Z         x = x_sign * x_clamp
2025-05-07T20:33:00.7012050Z         x0 = x[:, :D]
2025-05-07T20:33:00.7012172Z         x1 = x[:, D:]
2025-05-07T20:33:00.7012241Z     
2025-05-07T20:33:00.7012321Z         if contiguous:
2025-05-07T20:33:00.7012410Z             x0 = x0.contiguous()
2025-05-07T20:33:00.7012496Z             x1 = x1.contiguous()
2025-05-07T20:33:00.7012565Z     
2025-05-07T20:33:00.7012657Z         if scale_ub is not None:
2025-05-07T20:33:00.7012763Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.7012899Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.7012973Z             )
2025-05-07T20:33:00.7013046Z         else:
2025-05-07T20:33:00.7013138Z             scale_ub_tensor = None
2025-05-07T20:33:00.7013247Z     
2025-05-07T20:33:00.7013376Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.7013466Z             op = silu_mul_quant
2025-05-07T20:33:00.7013546Z             if compiled:
2025-05-07T20:33:00.7013642Z                 op = torch.compile(op)
2025-05-07T20:33:00.7013747Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7013819Z     
2025-05-07T20:33:00.7013906Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.7013914Z 
2025-05-07T20:33:00.7014008Z moe/activation_test.py:117: 
2025-05-07T20:33:00.7014130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7014231Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.7014332Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7014707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.7014840Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.7015348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.7015442Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.7015804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.7016030Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.7016379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.7016470Z     kernel = self.compile(
2025-05-07T20:33:00.7016859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.7017036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.7017161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7017168Z 
2025-05-07T20:33:00.7017375Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505035f90>
2025-05-07T20:33:00.7018256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.7018769Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd5050480d0>}
2025-05-07T20:33:00.7019574Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.7019767Z context = <triton._C.libtriton.ir.context object at 0x7fd504f9f670>
2025-05-07T20:33:00.7019774Z 
2025-05-07T20:33:00.7019944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.7020211Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.7020316Z                            module_map=module_map)
2025-05-07T20:33:00.7020479Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.7020618Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.7020695Z E       ^
2025-05-07T20:33:00.7021052Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.7021056Z 
2025-05-07T20:33:00.7021487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.7021492Z 
2025-05-07T20:33:00.7021593Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7021815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7021896Z     T=2048,
2025-05-07T20:33:00.7022006Z     D=7168,
2025-05-07T20:33:00.7022087Z     scale_ub=None,
2025-05-07T20:33:00.7022172Z     contiguous=True,
2025-05-07T20:33:00.7022250Z     compiled=True,
2025-05-07T20:33:00.7022318Z )
2025-05-07T20:33:00.7022539Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7022711Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.7022716Z 
2025-05-07T20:33:00.7022792Z     @given(
2025-05-07T20:33:00.7022907Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7023003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7023117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7023235Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7023345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7023486Z     )
2025-05-07T20:33:00.7023733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7023831Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7023903Z         self,
2025-05-07T20:33:00.7023975Z         T: int,
2025-05-07T20:33:00.7024050Z         D: int,
2025-05-07T20:33:00.7024145Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7024231Z         contiguous: bool,
2025-05-07T20:33:00.7024317Z         compiled: bool,
2025-05-07T20:33:00.7024390Z     ) -> None:
2025-05-07T20:33:00.7024484Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7024558Z     
2025-05-07T20:33:00.7024726Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7024797Z     
2025-05-07T20:33:00.7024907Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7025047Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7025148Z         x = x_sign * x_clamp
2025-05-07T20:33:00.7025226Z         x0 = x[:, :D]
2025-05-07T20:33:00.7025305Z         x1 = x[:, D:]
2025-05-07T20:33:00.7025375Z     
2025-05-07T20:33:00.7025457Z         if contiguous:
2025-05-07T20:33:00.7025548Z             x0 = x0.contiguous()
2025-05-07T20:33:00.7025636Z             x1 = x1.contiguous()
2025-05-07T20:33:00.7025705Z     
2025-05-07T20:33:00.7025792Z         if scale_ub is not None:
2025-05-07T20:33:00.7025895Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.7026028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.7026104Z             )
2025-05-07T20:33:00.7026180Z         else:
2025-05-07T20:33:00.7026270Z             scale_ub_tensor = None
2025-05-07T20:33:00.7026339Z     
2025-05-07T20:33:00.7026470Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.7026558Z             op = silu_mul_quant
2025-05-07T20:33:00.7026689Z             if compiled:
2025-05-07T20:33:00.7026786Z                 op = torch.compile(op)
2025-05-07T20:33:00.7026889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7026965Z     
2025-05-07T20:33:00.7027052Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.7027059Z 
2025-05-07T20:33:00.7027154Z moe/activation_test.py:117: 
2025-05-07T20:33:00.7027280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7027379Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.7027476Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7027896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.7027986Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.7028489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.7028592Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.7028954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.7029181Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.7029565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.7029659Z     kernel = self.compile(
2025-05-07T20:33:00.7030046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.7030225Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.7030351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7030355Z 
2025-05-07T20:33:00.7030560Z self = <triton.compiler.compiler.ASTSource object at 0x7fd504f24130>
2025-05-07T20:33:00.7031354Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.7031903Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd50504a560>}
2025-05-07T20:33:00.7032664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.7032862Z context = <triton._C.libtriton.ir.context object at 0x7fd504fce1f0>
2025-05-07T20:33:00.7032866Z 
2025-05-07T20:33:00.7033031Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.7033299Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.7033405Z                            module_map=module_map)
2025-05-07T20:33:00.7033564Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.7033667Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.7033739Z E       ^
2025-05-07T20:33:00.7034101Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.7034108Z 
2025-05-07T20:33:00.7034527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.7034532Z 
2025-05-07T20:33:00.7034637Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7034864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7034941Z     T=16384,
2025-05-07T20:33:00.7035027Z     D=5120,
2025-05-07T20:33:00.7035122Z     scale_ub=None,
2025-05-07T20:33:00.7035220Z     contiguous=False,
2025-05-07T20:33:00.7035313Z     compiled=False,
2025-05-07T20:33:00.7035426Z )
2025-05-07T20:33:00.7035646Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7035829Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.7035839Z 
2025-05-07T20:33:00.7035916Z     @given(
2025-05-07T20:33:00.7036032Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7036130Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7036243Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7036357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7036510Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7036580Z     )
2025-05-07T20:33:00.7036830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7036926Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7036999Z         self,
2025-05-07T20:33:00.7037075Z         T: int,
2025-05-07T20:33:00.7037149Z         D: int,
2025-05-07T20:33:00.7037244Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7037331Z         contiguous: bool,
2025-05-07T20:33:00.7037412Z         compiled: bool,
2025-05-07T20:33:00.7037488Z     ) -> None:
2025-05-07T20:33:00.7037583Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7037692Z     
2025-05-07T20:33:00.7037863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7037936Z     
2025-05-07T20:33:00.7038027Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7038150Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7040024Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7040070Z 
2025-05-07T20:33:00.7040187Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:00.7040199Z 
2025-05-07T20:33:00.7040298Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7040521Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7040599Z     T=4096,
2025-05-07T20:33:00.7040674Z     D=7168,
2025-05-07T20:33:00.7040753Z     scale_ub=1200.0,
2025-05-07T20:33:00.7040839Z     contiguous=True,
2025-05-07T20:33:00.7040918Z     compiled=True,
2025-05-07T20:33:00.7040987Z )
2025-05-07T20:33:00.7041215Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7041385Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.7041389Z 
2025-05-07T20:33:00.7041464Z     @given(
2025-05-07T20:33:00.7041581Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7041676Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7041793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7041909Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7042018Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7042090Z     )
2025-05-07T20:33:00.7042337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7042427Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7042505Z         self,
2025-05-07T20:33:00.7042577Z         T: int,
2025-05-07T20:33:00.7042648Z         D: int,
2025-05-07T20:33:00.7042744Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7042829Z         contiguous: bool,
2025-05-07T20:33:00.7042913Z         compiled: bool,
2025-05-07T20:33:00.7042986Z     ) -> None:
2025-05-07T20:33:00.7043078Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7043193Z     
2025-05-07T20:33:00.7043361Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7043431Z     
2025-05-07T20:33:00.7043524Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7043648Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7045487Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7045543Z 
2025-05-07T20:33:00.7045661Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:00.7045668Z 
2025-05-07T20:33:00.7045768Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7045994Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7046071Z     T=16384,
2025-05-07T20:33:00.7046144Z     D=7168,
2025-05-07T20:33:00.7046261Z     scale_ub=None,
2025-05-07T20:33:00.7046344Z     contiguous=False,
2025-05-07T20:33:00.7046429Z     compiled=False,
2025-05-07T20:33:00.7046497Z )
2025-05-07T20:33:00.7046714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7046892Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.7046900Z 
2025-05-07T20:33:00.7046972Z     @given(
2025-05-07T20:33:00.7047084Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7047187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7047297Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7047420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7047528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7047598Z     )
2025-05-07T20:33:00.7047889Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7047982Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7048054Z         self,
2025-05-07T20:33:00.7048128Z         T: int,
2025-05-07T20:33:00.7048201Z         D: int,
2025-05-07T20:33:00.7048295Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7048384Z         contiguous: bool,
2025-05-07T20:33:00.7048468Z         compiled: bool,
2025-05-07T20:33:00.7048544Z     ) -> None:
2025-05-07T20:33:00.7048638Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7048706Z     
2025-05-07T20:33:00.7048872Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7050719Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7050728Z 
2025-05-07T20:33:00.7050847Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7050852Z 
2025-05-07T20:33:00.7050954Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7051180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7051255Z     T=2048,
2025-05-07T20:33:00.7051327Z     D=7168,
2025-05-07T20:33:00.7051407Z     scale_ub=1200.0,
2025-05-07T20:33:00.7051490Z     contiguous=True,
2025-05-07T20:33:00.7051570Z     compiled=True,
2025-05-07T20:33:00.7051641Z )
2025-05-07T20:33:00.7051904Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7052074Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.7052081Z 
2025-05-07T20:33:00.7052156Z     @given(
2025-05-07T20:33:00.7052272Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7052366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7052481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7052594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7052702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7052816Z     )
2025-05-07T20:33:00.7053061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7053152Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7053226Z         self,
2025-05-07T20:33:00.7053299Z         T: int,
2025-05-07T20:33:00.7053376Z         D: int,
2025-05-07T20:33:00.7053472Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7053556Z         contiguous: bool,
2025-05-07T20:33:00.7053641Z         compiled: bool,
2025-05-07T20:33:00.7053715Z     ) -> None:
2025-05-07T20:33:00.7053809Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7053882Z     
2025-05-07T20:33:00.7054112Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7054184Z     
2025-05-07T20:33:00.7054275Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7054399Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7056615Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7056714Z 
2025-05-07T20:33:00.7056840Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:00.7056845Z 
2025-05-07T20:33:00.7056952Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7057177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7057251Z     T=2048,
2025-05-07T20:33:00.7057327Z     D=7168,
2025-05-07T20:33:00.7057407Z     scale_ub=None,
2025-05-07T20:33:00.7057489Z     contiguous=True,
2025-05-07T20:33:00.7057579Z     compiled=False,
2025-05-07T20:33:00.7057649Z )
2025-05-07T20:33:00.7057868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7058138Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.7058142Z 
2025-05-07T20:33:00.7058220Z     @given(
2025-05-07T20:33:00.7058337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7058438Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7058549Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7058674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7058791Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7058864Z     )
2025-05-07T20:33:00.7059111Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7059205Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7059278Z         self,
2025-05-07T20:33:00.7059358Z         T: int,
2025-05-07T20:33:00.7059432Z         D: int,
2025-05-07T20:33:00.7059526Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7059620Z         contiguous: bool,
2025-05-07T20:33:00.7059703Z         compiled: bool,
2025-05-07T20:33:00.7059782Z     ) -> None:
2025-05-07T20:33:00.7059875Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7059945Z     
2025-05-07T20:33:00.7060181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7060255Z     
2025-05-07T20:33:00.7060343Z >       x_sign = torch.sign(x)
2025-05-07T20:33:00.7062190Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7062250Z 
2025-05-07T20:33:00.7062372Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:00.7062377Z 
2025-05-07T20:33:00.7062480Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7062704Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7062782Z     T=1,
2025-05-07T20:33:00.7062861Z     D=7168,
2025-05-07T20:33:00.7062945Z     scale_ub=1200.0,
2025-05-07T20:33:00.7063029Z     contiguous=True,
2025-05-07T20:33:00.7063178Z     compiled=False,
2025-05-07T20:33:00.7063249Z )
2025-05-07T20:33:00.7063470Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7063635Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.7063639Z 
2025-05-07T20:33:00.7063712Z     @given(
2025-05-07T20:33:00.7063834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7063931Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7064045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7064163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7064273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7064350Z     )
2025-05-07T20:33:00.7064595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7064687Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7064832Z         self,
2025-05-07T20:33:00.7064904Z         T: int,
2025-05-07T20:33:00.7064978Z         D: int,
2025-05-07T20:33:00.7065074Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7065161Z         contiguous: bool,
2025-05-07T20:33:00.7065254Z         compiled: bool,
2025-05-07T20:33:00.7065345Z     ) -> None:
2025-05-07T20:33:00.7065447Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7065533Z     
2025-05-07T20:33:00.7065707Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7065777Z     
2025-05-07T20:33:00.7065864Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7065990Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7066076Z         x = x_sign * x_clamp
2025-05-07T20:33:00.7066155Z         x0 = x[:, :D]
2025-05-07T20:33:00.7066235Z         x1 = x[:, D:]
2025-05-07T20:33:00.7066303Z     
2025-05-07T20:33:00.7066387Z         if contiguous:
2025-05-07T20:33:00.7066477Z             x0 = x0.contiguous()
2025-05-07T20:33:00.7066562Z             x1 = x1.contiguous()
2025-05-07T20:33:00.7066633Z     
2025-05-07T20:33:00.7066722Z         if scale_ub is not None:
2025-05-07T20:33:00.7066823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.7066958Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.7067029Z             )
2025-05-07T20:33:00.7067100Z         else:
2025-05-07T20:33:00.7067198Z             scale_ub_tensor = None
2025-05-07T20:33:00.7067268Z     
2025-05-07T20:33:00.7067398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.7067485Z             op = silu_mul_quant
2025-05-07T20:33:00.7067565Z             if compiled:
2025-05-07T20:33:00.7067664Z                 op = torch.compile(op)
2025-05-07T20:33:00.7067818Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7067890Z     
2025-05-07T20:33:00.7067980Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.7067985Z 
2025-05-07T20:33:00.7068081Z moe/activation_test.py:117: 
2025-05-07T20:33:00.7068209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7068311Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.7068409Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7068923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.7069059Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.7069423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.7069648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.7069998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.7070090Z     kernel = self.compile(
2025-05-07T20:33:00.7070482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.7070702Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.7070830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7070835Z 
2025-05-07T20:33:00.7071041Z self = <triton.compiler.compiler.ASTSource object at 0x7fd504fdf2e0>
2025-05-07T20:33:00.7071834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.7072353Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd504d204c0>}
2025-05-07T20:33:00.7073115Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.7073352Z context = <triton._C.libtriton.ir.context object at 0x7fd504db8970>
2025-05-07T20:33:00.7073357Z 
2025-05-07T20:33:00.7073522Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.7073791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.7073900Z                            module_map=module_map)
2025-05-07T20:33:00.7074059Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.7074160Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.7074234Z E       ^
2025-05-07T20:33:00.7074595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.7074600Z 
2025-05-07T20:33:00.7075021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.7075028Z 
2025-05-07T20:33:00.7075133Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7075360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7075436Z     T=128,
2025-05-07T20:33:00.7075512Z     D=5120,
2025-05-07T20:33:00.7075594Z     scale_ub=None,
2025-05-07T20:33:00.7075676Z     contiguous=True,
2025-05-07T20:33:00.7075764Z     compiled=False,
2025-05-07T20:33:00.7075837Z )
2025-05-07T20:33:00.7076054Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7076223Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.7076231Z 
2025-05-07T20:33:00.7076306Z     @given(
2025-05-07T20:33:00.7076465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7076566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7076679Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7076798Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7076914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7076986Z     )
2025-05-07T20:33:00.7077232Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7077329Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7077405Z         self,
2025-05-07T20:33:00.7077479Z         T: int,
2025-05-07T20:33:00.7077598Z         D: int,
2025-05-07T20:33:00.7077693Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7077783Z         contiguous: bool,
2025-05-07T20:33:00.7077866Z         compiled: bool,
2025-05-07T20:33:00.7077943Z     ) -> None:
2025-05-07T20:33:00.7078037Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7078110Z     
2025-05-07T20:33:00.7078280Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7078353Z     
2025-05-07T20:33:00.7078442Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7078565Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7078652Z         x = x_sign * x_clamp
2025-05-07T20:33:00.7078768Z         x0 = x[:, :D]
2025-05-07T20:33:00.7078845Z         x1 = x[:, D:]
2025-05-07T20:33:00.7078918Z     
2025-05-07T20:33:00.7078997Z         if contiguous:
2025-05-07T20:33:00.7079086Z             x0 = x0.contiguous()
2025-05-07T20:33:00.7079176Z             x1 = x1.contiguous()
2025-05-07T20:33:00.7079248Z     
2025-05-07T20:33:00.7079338Z         if scale_ub is not None:
2025-05-07T20:33:00.7079441Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.7079572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.7079647Z             )
2025-05-07T20:33:00.7079722Z         else:
2025-05-07T20:33:00.7079813Z             scale_ub_tensor = None
2025-05-07T20:33:00.7079886Z     
2025-05-07T20:33:00.7080014Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.7080100Z             op = silu_mul_quant
2025-05-07T20:33:00.7080230Z             if compiled:
2025-05-07T20:33:00.7080327Z                 op = torch.compile(op)
2025-05-07T20:33:00.7084261Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7084340Z     
2025-05-07T20:33:00.7084435Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.7084440Z 
2025-05-07T20:33:00.7084540Z moe/activation_test.py:117: 
2025-05-07T20:33:00.7084667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7084792Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.7084898Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7085432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.7085529Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.7085898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.7086126Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.7086478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.7086571Z     kernel = self.compile(
2025-05-07T20:33:00.7086968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.7087142Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.7087271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7087275Z 
2025-05-07T20:33:00.7087483Z self = <triton.compiler.compiler.ASTSource object at 0x7fd504d3d660>
2025-05-07T20:33:00.7088342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.7088863Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd504d20940>}
2025-05-07T20:33:00.7089627Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.7089885Z context = <triton._C.libtriton.ir.context object at 0x7fd504d667f0>
2025-05-07T20:33:00.7089894Z 
2025-05-07T20:33:00.7090059Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.7090329Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.7090439Z                            module_map=module_map)
2025-05-07T20:33:00.7090600Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.7090696Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.7090774Z E       ^
2025-05-07T20:33:00.7091174Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.7091180Z 
2025-05-07T20:33:00.7091605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.7091610Z 
2025-05-07T20:33:00.7091712Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7091940Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7092017Z     T=128,
2025-05-07T20:33:00.7092089Z     D=7168,
2025-05-07T20:33:00.7092166Z     scale_ub=None,
2025-05-07T20:33:00.7092250Z     contiguous=True,
2025-05-07T20:33:00.7092329Z     compiled=False,
2025-05-07T20:33:00.7092398Z )
2025-05-07T20:33:00.7092623Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7092791Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.7092838Z 
2025-05-07T20:33:00.7092916Z     @given(
2025-05-07T20:33:00.7093035Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7093133Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7093248Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7093363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7093474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7093550Z     )
2025-05-07T20:33:00.7093796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7093891Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7093963Z         self,
2025-05-07T20:33:00.7094038Z         T: int,
2025-05-07T20:33:00.7094112Z         D: int,
2025-05-07T20:33:00.7094208Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7094293Z         contiguous: bool,
2025-05-07T20:33:00.7094377Z         compiled: bool,
2025-05-07T20:33:00.7094454Z     ) -> None:
2025-05-07T20:33:00.7094547Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7094618Z     
2025-05-07T20:33:00.7094790Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7094861Z     
2025-05-07T20:33:00.7094954Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7095077Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7095179Z         x = x_sign * x_clamp
2025-05-07T20:33:00.7095268Z         x0 = x[:, :D]
2025-05-07T20:33:00.7095360Z         x1 = x[:, D:]
2025-05-07T20:33:00.7095441Z     
2025-05-07T20:33:00.7095520Z         if contiguous:
2025-05-07T20:33:00.7095608Z             x0 = x0.contiguous()
2025-05-07T20:33:00.7095698Z             x1 = x1.contiguous()
2025-05-07T20:33:00.7095768Z     
2025-05-07T20:33:00.7095900Z         if scale_ub is not None:
2025-05-07T20:33:00.7096012Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.7096146Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.7096220Z             )
2025-05-07T20:33:00.7096295Z         else:
2025-05-07T20:33:00.7096388Z             scale_ub_tensor = None
2025-05-07T20:33:00.7096457Z     
2025-05-07T20:33:00.7096589Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.7096675Z             op = silu_mul_quant
2025-05-07T20:33:00.7096760Z             if compiled:
2025-05-07T20:33:00.7096856Z                 op = torch.compile(op)
2025-05-07T20:33:00.7097002Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7097075Z     
2025-05-07T20:33:00.7097161Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.7097165Z 
2025-05-07T20:33:00.7097260Z moe/activation_test.py:117: 
2025-05-07T20:33:00.7097390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7097491Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.7097588Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7098204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.7098346Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.7098712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.7098934Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.7099282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.7099375Z     kernel = self.compile(
2025-05-07T20:33:00.7099763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.7099943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.7100064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7100068Z 
2025-05-07T20:33:00.7100317Z self = <triton.compiler.compiler.ASTSource object at 0x7fd504d491e0>
2025-05-07T20:33:00.7101118Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.7101628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd504d21240>}
2025-05-07T20:33:00.7102393Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.7102587Z context = <triton._C.libtriton.ir.context object at 0x7fd504e6f3f0>
2025-05-07T20:33:00.7102592Z 
2025-05-07T20:33:00.7102756Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.7103026Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.7103135Z                            module_map=module_map)
2025-05-07T20:33:00.7103298Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.7103395Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.7103470Z E       ^
2025-05-07T20:33:00.7103833Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.7103840Z 
2025-05-07T20:33:00.7104265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.7104270Z 
2025-05-07T20:33:00.7104372Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7104639Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7104720Z     T=2048,
2025-05-07T20:33:00.7104794Z     D=7168,
2025-05-07T20:33:00.7104882Z     scale_ub=1200.0,
2025-05-07T20:33:00.7104965Z     contiguous=True,
2025-05-07T20:33:00.7105049Z     compiled=False,
2025-05-07T20:33:00.7105122Z )
2025-05-07T20:33:00.7105340Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7105513Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.7105518Z 
2025-05-07T20:33:00.7105593Z     @given(
2025-05-07T20:33:00.7105753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7105850Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7105966Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7106081Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7106195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7106268Z     )
2025-05-07T20:33:00.7106515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7106613Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7106688Z         self,
2025-05-07T20:33:00.7106762Z         T: int,
2025-05-07T20:33:00.7106877Z         D: int,
2025-05-07T20:33:00.7106974Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7107064Z         contiguous: bool,
2025-05-07T20:33:00.7107149Z         compiled: bool,
2025-05-07T20:33:00.7107224Z     ) -> None:
2025-05-07T20:33:00.7107319Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7107394Z     
2025-05-07T20:33:00.7107562Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7109415Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7109461Z 
2025-05-07T20:33:00.7109579Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7109584Z 
2025-05-07T20:33:00.7109687Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7109913Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7109991Z     T=1,
2025-05-07T20:33:00.7110068Z     D=5120,
2025-05-07T20:33:00.7110150Z     scale_ub=1200.0,
2025-05-07T20:33:00.7110234Z     contiguous=True,
2025-05-07T20:33:00.7110317Z     compiled=False,
2025-05-07T20:33:00.7110387Z )
2025-05-07T20:33:00.7110604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7110773Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.7110777Z 
2025-05-07T20:33:00.7110850Z     @given(
2025-05-07T20:33:00.7110970Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7111065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7111178Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7111294Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7111402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7111473Z     )
2025-05-07T20:33:00.7111724Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7111819Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7111890Z         self,
2025-05-07T20:33:00.7111964Z         T: int,
2025-05-07T20:33:00.7112039Z         D: int,
2025-05-07T20:33:00.7112137Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7112222Z         contiguous: bool,
2025-05-07T20:33:00.7112347Z         compiled: bool,
2025-05-07T20:33:00.7112424Z     ) -> None:
2025-05-07T20:33:00.7112515Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7112584Z     
2025-05-07T20:33:00.7112755Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7112825Z     
2025-05-07T20:33:00.7112916Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7113041Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7113125Z         x = x_sign * x_clamp
2025-05-07T20:33:00.7113203Z         x0 = x[:, :D]
2025-05-07T20:33:00.7113280Z         x1 = x[:, D:]
2025-05-07T20:33:00.7113390Z     
2025-05-07T20:33:00.7113471Z         if contiguous:
2025-05-07T20:33:00.7113562Z             x0 = x0.contiguous()
2025-05-07T20:33:00.7113646Z             x1 = x1.contiguous()
2025-05-07T20:33:00.7113718Z     
2025-05-07T20:33:00.7113805Z         if scale_ub is not None:
2025-05-07T20:33:00.7113906Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.7114044Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.7114115Z             )
2025-05-07T20:33:00.7114189Z         else:
2025-05-07T20:33:00.7114286Z             scale_ub_tensor = None
2025-05-07T20:33:00.7114354Z     
2025-05-07T20:33:00.7114523Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.7114614Z             op = silu_mul_quant
2025-05-07T20:33:00.7114695Z             if compiled:
2025-05-07T20:33:00.7114814Z                 op = torch.compile(op)
2025-05-07T20:33:00.7114931Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7115014Z     
2025-05-07T20:33:00.7115107Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.7115112Z 
2025-05-07T20:33:00.7115205Z moe/activation_test.py:117: 
2025-05-07T20:33:00.7115330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7115432Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.7115528Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7116038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.7116182Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.7116548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.7116774Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.7117120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.7117213Z     kernel = self.compile(
2025-05-07T20:33:00.7117603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.7117777Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.7117898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7117907Z 
2025-05-07T20:33:00.7118112Z self = <triton.compiler.compiler.ASTSource object at 0x7fd504d49d20>
2025-05-07T20:33:00.7118908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.7119426Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd504d22200>}
2025-05-07T20:33:00.7120189Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.7120385Z context = <triton._C.libtriton.ir.context object at 0x7fd504e55830>
2025-05-07T20:33:00.7120390Z 
2025-05-07T20:33:00.7120619Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.7120890Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.7121000Z                            module_map=module_map)
2025-05-07T20:33:00.7121162Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.7121263Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.7121335Z E       ^
2025-05-07T20:33:00.7121692Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.7121697Z 
2025-05-07T20:33:00.7122157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.7122162Z 
2025-05-07T20:33:00.7122263Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7122485Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7122563Z     T=2048,
2025-05-07T20:33:00.7122638Z     D=5120,
2025-05-07T20:33:00.7122719Z     scale_ub=None,
2025-05-07T20:33:00.7122800Z     contiguous=True,
2025-05-07T20:33:00.7122879Z     compiled=False,
2025-05-07T20:33:00.7122953Z )
2025-05-07T20:33:00.7123208Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7123382Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.7123387Z 
2025-05-07T20:33:00.7123464Z     @given(
2025-05-07T20:33:00.7123578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7123676Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7123794Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7123908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7124021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7124092Z     )
2025-05-07T20:33:00.7124340Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7124433Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7124505Z         self,
2025-05-07T20:33:00.7124579Z         T: int,
2025-05-07T20:33:00.7124696Z         D: int,
2025-05-07T20:33:00.7124789Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7124876Z         contiguous: bool,
2025-05-07T20:33:00.7124961Z         compiled: bool,
2025-05-07T20:33:00.7125033Z     ) -> None:
2025-05-07T20:33:00.7125124Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7125201Z     
2025-05-07T20:33:00.7125394Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7125490Z     
2025-05-07T20:33:00.7125577Z >       x_sign = torch.sign(x)
2025-05-07T20:33:00.7127422Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7127434Z 
2025-05-07T20:33:00.7127549Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:00.7127554Z 
2025-05-07T20:33:00.7127653Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7127880Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7127957Z     T=16384,
2025-05-07T20:33:00.7128030Z     D=5120,
2025-05-07T20:33:00.7128112Z     scale_ub=None,
2025-05-07T20:33:00.7128193Z     contiguous=True,
2025-05-07T20:33:00.7128274Z     compiled=False,
2025-05-07T20:33:00.7128345Z )
2025-05-07T20:33:00.7128563Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7128781Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.7128786Z 
2025-05-07T20:33:00.7128860Z     @given(
2025-05-07T20:33:00.7128974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7129074Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7129188Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7129304Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7129416Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7129486Z     )
2025-05-07T20:33:00.7129732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7129867Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7129939Z         self,
2025-05-07T20:33:00.7130017Z         T: int,
2025-05-07T20:33:00.7130089Z         D: int,
2025-05-07T20:33:00.7130182Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7130269Z         contiguous: bool,
2025-05-07T20:33:00.7130350Z         compiled: bool,
2025-05-07T20:33:00.7130426Z     ) -> None:
2025-05-07T20:33:00.7130520Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7130587Z     
2025-05-07T20:33:00.7130754Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7132639Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7132647Z 
2025-05-07T20:33:00.7132764Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7132768Z 
2025-05-07T20:33:00.7132876Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7133099Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7133175Z     T=4096,
2025-05-07T20:33:00.7133287Z     D=5120,
2025-05-07T20:33:00.7133366Z     scale_ub=None,
2025-05-07T20:33:00.7133451Z     contiguous=True,
2025-05-07T20:33:00.7133532Z     compiled=False,
2025-05-07T20:33:00.7133600Z )
2025-05-07T20:33:00.7133820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7133991Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.7133996Z 
2025-05-07T20:33:00.7134075Z     @given(
2025-05-07T20:33:00.7134191Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7134287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7134399Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7134513Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7134624Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7134696Z     )
2025-05-07T20:33:00.7134941Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7135035Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7135109Z         self,
2025-05-07T20:33:00.7135182Z         T: int,
2025-05-07T20:33:00.7135253Z         D: int,
2025-05-07T20:33:00.7135350Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7135437Z         contiguous: bool,
2025-05-07T20:33:00.7135517Z         compiled: bool,
2025-05-07T20:33:00.7135595Z     ) -> None:
2025-05-07T20:33:00.7135691Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7135765Z     
2025-05-07T20:33:00.7135933Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7137807Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7137819Z 
2025-05-07T20:33:00.7137937Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7137941Z 
2025-05-07T20:33:00.7138108Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7138339Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7138459Z     T=2048,
2025-05-07T20:33:00.7138534Z     D=5120,
2025-05-07T20:33:00.7138616Z     scale_ub=None,
2025-05-07T20:33:00.7138700Z     contiguous=False,
2025-05-07T20:33:00.7138782Z     compiled=False,
2025-05-07T20:33:00.7138855Z )
2025-05-07T20:33:00.7139078Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7139253Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.7139258Z 
2025-05-07T20:33:00.7139335Z     @given(
2025-05-07T20:33:00.7139448Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7139591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7139704Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7139818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7139933Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7140006Z     )
2025-05-07T20:33:00.7140255Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7140350Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7140422Z         self,
2025-05-07T20:33:00.7140499Z         T: int,
2025-05-07T20:33:00.7140571Z         D: int,
2025-05-07T20:33:00.7140668Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7140759Z         contiguous: bool,
2025-05-07T20:33:00.7140842Z         compiled: bool,
2025-05-07T20:33:00.7140918Z     ) -> None:
2025-05-07T20:33:00.7141018Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7141130Z     
2025-05-07T20:33:00.7141298Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7143135Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7143145Z 
2025-05-07T20:33:00.7143261Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7143267Z 
2025-05-07T20:33:00.7143368Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7143590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7143671Z     T=4096,
2025-05-07T20:33:00.7143743Z     D=7168,
2025-05-07T20:33:00.7143823Z     scale_ub=None,
2025-05-07T20:33:00.7143906Z     contiguous=True,
2025-05-07T20:33:00.7143984Z     compiled=True,
2025-05-07T20:33:00.7144052Z )
2025-05-07T20:33:00.7144272Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7144439Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.7144447Z 
2025-05-07T20:33:00.7144519Z     @given(
2025-05-07T20:33:00.7144634Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7144729Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7144869Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7145050Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7145161Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7145234Z     )
2025-05-07T20:33:00.7145484Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7145578Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7145656Z         self,
2025-05-07T20:33:00.7145727Z         T: int,
2025-05-07T20:33:00.7145801Z         D: int,
2025-05-07T20:33:00.7145894Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7145979Z         contiguous: bool,
2025-05-07T20:33:00.7146063Z         compiled: bool,
2025-05-07T20:33:00.7146178Z     ) -> None:
2025-05-07T20:33:00.7146269Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7146340Z     
2025-05-07T20:33:00.7146507Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7148385Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7148394Z 
2025-05-07T20:33:00.7148510Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7148515Z 
2025-05-07T20:33:00.7148620Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7148846Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7148922Z     T=2048,
2025-05-07T20:33:00.7149001Z     D=5120,
2025-05-07T20:33:00.7149084Z     scale_ub=1200.0,
2025-05-07T20:33:00.7149171Z     contiguous=False,
2025-05-07T20:33:00.7149259Z     compiled=False,
2025-05-07T20:33:00.7149331Z )
2025-05-07T20:33:00.7149551Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7149728Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.7149796Z 
2025-05-07T20:33:00.7149874Z     @given(
2025-05-07T20:33:00.7149993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7150095Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7150207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7150330Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7150443Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7150521Z     )
2025-05-07T20:33:00.7150769Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7150863Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7150937Z         self,
2025-05-07T20:33:00.7151014Z         T: int,
2025-05-07T20:33:00.7151088Z         D: int,
2025-05-07T20:33:00.7151186Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7151277Z         contiguous: bool,
2025-05-07T20:33:00.7151361Z         compiled: bool,
2025-05-07T20:33:00.7151445Z     ) -> None:
2025-05-07T20:33:00.7151538Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7151611Z     
2025-05-07T20:33:00.7151784Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7153612Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7153620Z 
2025-05-07T20:33:00.7153781Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7153786Z 
2025-05-07T20:33:00.7153891Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7154118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7154199Z     T=4096,
2025-05-07T20:33:00.7154275Z     D=7168,
2025-05-07T20:33:00.7154358Z     scale_ub=1200.0,
2025-05-07T20:33:00.7154443Z     contiguous=True,
2025-05-07T20:33:00.7154526Z     compiled=False,
2025-05-07T20:33:00.7154597Z )
2025-05-07T20:33:00.7154817Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7155032Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.7155036Z 
2025-05-07T20:33:00.7155135Z     @given(
2025-05-07T20:33:00.7155263Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7155377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7155496Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7155931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7156093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7156175Z     )
2025-05-07T20:33:00.7156519Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7156619Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7156694Z         self,
2025-05-07T20:33:00.7156769Z         T: int,
2025-05-07T20:33:00.7156844Z         D: int,
2025-05-07T20:33:00.7156938Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7157029Z         contiguous: bool,
2025-05-07T20:33:00.7157113Z         compiled: bool,
2025-05-07T20:33:00.7157189Z     ) -> None:
2025-05-07T20:33:00.7157281Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7157356Z     
2025-05-07T20:33:00.7157522Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7159380Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7159454Z 
2025-05-07T20:33:00.7159572Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7159579Z 
2025-05-07T20:33:00.7159684Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7159907Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7159982Z     T=16384,
2025-05-07T20:33:00.7160058Z     D=7168,
2025-05-07T20:33:00.7160138Z     scale_ub=None,
2025-05-07T20:33:00.7160221Z     contiguous=False,
2025-05-07T20:33:00.7160308Z     compiled=True,
2025-05-07T20:33:00.7160379Z )
2025-05-07T20:33:00.7160599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7160779Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.7160786Z 
2025-05-07T20:33:00.7160860Z     @given(
2025-05-07T20:33:00.7160977Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7161073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7161185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7161304Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7161418Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7161489Z     )
2025-05-07T20:33:00.7161741Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7161835Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7161908Z         self,
2025-05-07T20:33:00.7162048Z         T: int,
2025-05-07T20:33:00.7162123Z         D: int,
2025-05-07T20:33:00.7162225Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7162314Z         contiguous: bool,
2025-05-07T20:33:00.7162399Z         compiled: bool,
2025-05-07T20:33:00.7162478Z     ) -> None:
2025-05-07T20:33:00.7162575Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7162645Z     
2025-05-07T20:33:00.7162816Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7164649Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7164712Z 
2025-05-07T20:33:00.7164835Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7164842Z 
2025-05-07T20:33:00.7164945Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7165205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7165284Z     T=4096,
2025-05-07T20:33:00.7165358Z     D=7168,
2025-05-07T20:33:00.7165439Z     scale_ub=None,
2025-05-07T20:33:00.7165528Z     contiguous=True,
2025-05-07T20:33:00.7165610Z     compiled=False,
2025-05-07T20:33:00.7165685Z )
2025-05-07T20:33:00.7165907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7166078Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.7166082Z 
2025-05-07T20:33:00.7166159Z     @given(
2025-05-07T20:33:00.7166276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7166375Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7166489Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7166605Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7166760Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7166834Z     )
2025-05-07T20:33:00.7167081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7167178Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7167252Z         self,
2025-05-07T20:33:00.7167327Z         T: int,
2025-05-07T20:33:00.7167407Z         D: int,
2025-05-07T20:33:00.7167507Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7167594Z         contiguous: bool,
2025-05-07T20:33:00.7167681Z         compiled: bool,
2025-05-07T20:33:00.7167755Z     ) -> None:
2025-05-07T20:33:00.7167849Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7167922Z     
2025-05-07T20:33:00.7168088Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7169925Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7169936Z 
2025-05-07T20:33:00.7170053Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7170057Z 
2025-05-07T20:33:00.7170161Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7170384Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7170459Z     T=16384,
2025-05-07T20:33:00.7170534Z     D=7168,
2025-05-07T20:33:00.7170657Z     scale_ub=None,
2025-05-07T20:33:00.7170741Z     contiguous=True,
2025-05-07T20:33:00.7170825Z     compiled=False,
2025-05-07T20:33:00.7170894Z )
2025-05-07T20:33:00.7171116Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7171295Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.7171300Z 
2025-05-07T20:33:00.7171374Z     @given(
2025-05-07T20:33:00.7171491Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7171587Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7171700Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7171860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7171972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7172043Z     )
2025-05-07T20:33:00.7172293Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7172385Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7172461Z         self,
2025-05-07T20:33:00.7172538Z         T: int,
2025-05-07T20:33:00.7172611Z         D: int,
2025-05-07T20:33:00.7172707Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7172800Z         contiguous: bool,
2025-05-07T20:33:00.7172923Z         compiled: bool,
2025-05-07T20:33:00.7173004Z     ) -> None:
2025-05-07T20:33:00.7173100Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7173171Z     
2025-05-07T20:33:00.7173342Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7175229Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7175238Z 
2025-05-07T20:33:00.7175395Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7175400Z 
2025-05-07T20:33:00.7175503Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7175726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7175805Z     T=16384,
2025-05-07T20:33:00.7175879Z     D=7168,
2025-05-07T20:33:00.7175960Z     scale_ub=1200.0,
2025-05-07T20:33:00.7176048Z     contiguous=True,
2025-05-07T20:33:00.7176135Z     compiled=False,
2025-05-07T20:33:00.7176209Z )
2025-05-07T20:33:00.7176427Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7176603Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.7176608Z 
2025-05-07T20:33:00.7176684Z     @given(
2025-05-07T20:33:00.7176799Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7176896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7177013Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7177130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7177246Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7177320Z     )
2025-05-07T20:33:00.7177568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7177664Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7177736Z         self,
2025-05-07T20:33:00.7177813Z         T: int,
2025-05-07T20:33:00.7177889Z         D: int,
2025-05-07T20:33:00.7177985Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7178167Z         contiguous: bool,
2025-05-07T20:33:00.7178254Z         compiled: bool,
2025-05-07T20:33:00.7178329Z     ) -> None:
2025-05-07T20:33:00.7178423Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7178495Z     
2025-05-07T20:33:00.7178709Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7180553Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7180621Z 
2025-05-07T20:33:00.7180737Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7180742Z 
2025-05-07T20:33:00.7180849Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7181072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7181150Z     T=128,
2025-05-07T20:33:00.7181228Z     D=5120,
2025-05-07T20:33:00.7181311Z     scale_ub=1200.0,
2025-05-07T20:33:00.7181395Z     contiguous=False,
2025-05-07T20:33:00.7181480Z     compiled=False,
2025-05-07T20:33:00.7181550Z )
2025-05-07T20:33:00.7181833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7182009Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.7182014Z 
2025-05-07T20:33:00.7182088Z     @given(
2025-05-07T20:33:00.7182208Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7182309Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7182421Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7182540Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7182651Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7182722Z     )
2025-05-07T20:33:00.7182974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7183066Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7183138Z         self,
2025-05-07T20:33:00.7183214Z         T: int,
2025-05-07T20:33:00.7183328Z         D: int,
2025-05-07T20:33:00.7183425Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7183518Z         contiguous: bool,
2025-05-07T20:33:00.7183601Z         compiled: bool,
2025-05-07T20:33:00.7183679Z     ) -> None:
2025-05-07T20:33:00.7183773Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7183842Z     
2025-05-07T20:33:00.7184012Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7184086Z     
2025-05-07T20:33:00.7184176Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7184305Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7184393Z         x = x_sign * x_clamp
2025-05-07T20:33:00.7184473Z         x0 = x[:, :D]
2025-05-07T20:33:00.7184554Z         x1 = x[:, D:]
2025-05-07T20:33:00.7184623Z     
2025-05-07T20:33:00.7184708Z         if contiguous:
2025-05-07T20:33:00.7184801Z             x0 = x0.contiguous()
2025-05-07T20:33:00.7184892Z             x1 = x1.contiguous()
2025-05-07T20:33:00.7184985Z     
2025-05-07T20:33:00.7185086Z         if scale_ub is not None:
2025-05-07T20:33:00.7185212Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.7185350Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.7185422Z             )
2025-05-07T20:33:00.7185494Z         else:
2025-05-07T20:33:00.7185589Z             scale_ub_tensor = None
2025-05-07T20:33:00.7185658Z     
2025-05-07T20:33:00.7185791Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.7185883Z             op = silu_mul_quant
2025-05-07T20:33:00.7185966Z             if compiled:
2025-05-07T20:33:00.7186067Z                 op = torch.compile(op)
2025-05-07T20:33:00.7186173Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7186242Z     
2025-05-07T20:33:00.7186378Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.7186386Z 
2025-05-07T20:33:00.7186484Z moe/activation_test.py:117: 
2025-05-07T20:33:00.7186610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7186716Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.7186817Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7187327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.7187427Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.7187792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.7188058Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.7188407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.7188500Z     kernel = self.compile(
2025-05-07T20:33:00.7188895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.7189073Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.7189233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7189239Z 
2025-05-07T20:33:00.7189448Z self = <triton.compiler.compiler.ASTSource object at 0x7fd505104280>
2025-05-07T20:33:00.7190241Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.7190760Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd504c45ea0>}
2025-05-07T20:33:00.7191526Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.7191763Z context = <triton._C.libtriton.ir.context object at 0x7fd5051db470>
2025-05-07T20:33:00.7191770Z 
2025-05-07T20:33:00.7191936Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.7192205Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.7192314Z                            module_map=module_map)
2025-05-07T20:33:00.7192478Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.7192578Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.7192654Z E       ^
2025-05-07T20:33:00.7193016Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.7193021Z 
2025-05-07T20:33:00.7193449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.7193454Z 
2025-05-07T20:33:00.7193558Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7193785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7193864Z     T=2048,
2025-05-07T20:33:00.7193937Z     D=7168,
2025-05-07T20:33:00.7194016Z     scale_ub=None,
2025-05-07T20:33:00.7194105Z     contiguous=False,
2025-05-07T20:33:00.7194186Z     compiled=False,
2025-05-07T20:33:00.7194259Z )
2025-05-07T20:33:00.7194478Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7194654Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.7194659Z 
2025-05-07T20:33:00.7194738Z     @given(
2025-05-07T20:33:00.7194854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7194951Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7195110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7195228Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7195346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7195420Z     )
2025-05-07T20:33:00.7195670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7195764Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7195834Z         self,
2025-05-07T20:33:00.7195907Z         T: int,
2025-05-07T20:33:00.7195984Z         D: int,
2025-05-07T20:33:00.7196081Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7196210Z         contiguous: bool,
2025-05-07T20:33:00.7196297Z         compiled: bool,
2025-05-07T20:33:00.7196373Z     ) -> None:
2025-05-07T20:33:00.7196466Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7196540Z     
2025-05-07T20:33:00.7196709Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7198597Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7198605Z 
2025-05-07T20:33:00.7198725Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7198729Z 
2025-05-07T20:33:00.7198835Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7199058Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7199133Z     T=128,
2025-05-07T20:33:00.7199211Z     D=7168,
2025-05-07T20:33:00.7199293Z     scale_ub=1200.0,
2025-05-07T20:33:00.7199378Z     contiguous=True,
2025-05-07T20:33:00.7199463Z     compiled=True,
2025-05-07T20:33:00.7199533Z )
2025-05-07T20:33:00.7199749Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7199962Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.7199967Z 
2025-05-07T20:33:00.7200041Z     @given(
2025-05-07T20:33:00.7200158Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7200255Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7200368Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7200488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7200599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7200671Z     )
2025-05-07T20:33:00.7200920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7201013Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7201086Z         self,
2025-05-07T20:33:00.7201170Z         T: int,
2025-05-07T20:33:00.7201243Z         D: int,
2025-05-07T20:33:00.7201340Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7201431Z         contiguous: bool,
2025-05-07T20:33:00.7201514Z         compiled: bool,
2025-05-07T20:33:00.7201595Z     ) -> None:
2025-05-07T20:33:00.7201688Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7201758Z     
2025-05-07T20:33:00.7201926Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7201999Z     
2025-05-07T20:33:00.7202088Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7202220Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7202307Z         x = x_sign * x_clamp
2025-05-07T20:33:00.7202386Z         x0 = x[:, :D]
2025-05-07T20:33:00.7202465Z         x1 = x[:, D:]
2025-05-07T20:33:00.7202533Z     
2025-05-07T20:33:00.7202612Z         if contiguous:
2025-05-07T20:33:00.7202703Z             x0 = x0.contiguous()
2025-05-07T20:33:00.7202834Z             x1 = x1.contiguous()
2025-05-07T20:33:00.7202908Z     
2025-05-07T20:33:00.7202995Z         if scale_ub is not None:
2025-05-07T20:33:00.7203098Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.7203235Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.7203309Z             )
2025-05-07T20:33:00.7203380Z         else:
2025-05-07T20:33:00.7203473Z             scale_ub_tensor = None
2025-05-07T20:33:00.7203543Z     
2025-05-07T20:33:00.7203672Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.7203761Z             op = silu_mul_quant
2025-05-07T20:33:00.7203884Z             if compiled:
2025-05-07T20:33:00.7203980Z                 op = torch.compile(op)
2025-05-07T20:33:00.7204084Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7204153Z     
2025-05-07T20:33:00.7204243Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.7204247Z 
2025-05-07T20:33:00.7204342Z moe/activation_test.py:117: 
2025-05-07T20:33:00.7204469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7204571Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.7204671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7205113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.7205217Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.7205737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.7205835Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.7206200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.7206423Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.7206778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.7206869Z     kernel = self.compile(
2025-05-07T20:33:00.7207256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.7207477Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.7207600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7207605Z 
2025-05-07T20:33:00.7207812Z self = <triton.compiler.compiler.ASTSource object at 0x7fd504b53e80>
2025-05-07T20:33:00.7212749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.7213294Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd722192ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd504c477f0>}
2025-05-07T20:33:00.7214056Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.7214260Z context = <triton._C.libtriton.ir.context object at 0x7fd504b5a530>
2025-05-07T20:33:00.7214266Z 
2025-05-07T20:33:00.7214431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.7214700Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.7214809Z                            module_map=module_map)
2025-05-07T20:33:00.7214996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.7215105Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.7215196Z E       ^
2025-05-07T20:33:00.7215619Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.7215624Z 
2025-05-07T20:33:00.7216046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.7216053Z 
2025-05-07T20:33:00.7216157Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7216381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7216459Z     T=128,
2025-05-07T20:33:00.7216534Z     D=7168,
2025-05-07T20:33:00.7216617Z     scale_ub=1200.0,
2025-05-07T20:33:00.7216697Z     contiguous=True,
2025-05-07T20:33:00.7216776Z     compiled=False,
2025-05-07T20:33:00.7216892Z )
2025-05-07T20:33:00.7217109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7217277Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.7217282Z 
2025-05-07T20:33:00.7217357Z     @given(
2025-05-07T20:33:00.7217473Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7217571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7217686Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7217802Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7217915Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7218029Z     )
2025-05-07T20:33:00.7218364Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7218459Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7218531Z         self,
2025-05-07T20:33:00.7218603Z         T: int,
2025-05-07T20:33:00.7218680Z         D: int,
2025-05-07T20:33:00.7218775Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7218860Z         contiguous: bool,
2025-05-07T20:33:00.7218945Z         compiled: bool,
2025-05-07T20:33:00.7219020Z     ) -> None:
2025-05-07T20:33:00.7219112Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7219184Z     
2025-05-07T20:33:00.7219355Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7219427Z     
2025-05-07T20:33:00.7219518Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7219641Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7221510Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7221518Z 
2025-05-07T20:33:00.7221634Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:00.7221638Z 
2025-05-07T20:33:00.7221741Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7221966Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7222040Z     T=128,
2025-05-07T20:33:00.7222119Z     D=5120,
2025-05-07T20:33:00.7222200Z     scale_ub=1200.0,
2025-05-07T20:33:00.7222282Z     contiguous=True,
2025-05-07T20:33:00.7222366Z     compiled=True,
2025-05-07T20:33:00.7222435Z )
2025-05-07T20:33:00.7222651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7222818Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.7222823Z 
2025-05-07T20:33:00.7222894Z     @given(
2025-05-07T20:33:00.7223014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7223111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7223223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7223339Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7223448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7223565Z     )
2025-05-07T20:33:00.7223814Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7223909Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7223987Z         self,
2025-05-07T20:33:00.7224059Z         T: int,
2025-05-07T20:33:00.7224136Z         D: int,
2025-05-07T20:33:00.7224232Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7224317Z         contiguous: bool,
2025-05-07T20:33:00.7224398Z         compiled: bool,
2025-05-07T20:33:00.7224479Z     ) -> None:
2025-05-07T20:33:00.7224570Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7224682Z     
2025-05-07T20:33:00.7224852Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7224923Z     
2025-05-07T20:33:00.7225009Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7225133Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7226977Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7226989Z 
2025-05-07T20:33:00.7227107Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:00.7227114Z 
2025-05-07T20:33:00.7227216Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7227441Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7227515Z     T=128,
2025-05-07T20:33:00.7227588Z     D=7168,
2025-05-07T20:33:00.7227675Z     scale_ub=None,
2025-05-07T20:33:00.7227755Z     contiguous=True,
2025-05-07T20:33:00.7227837Z     compiled=True,
2025-05-07T20:33:00.7227908Z )
2025-05-07T20:33:00.7228125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7228331Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.7228336Z 
2025-05-07T20:33:00.7228413Z     @given(
2025-05-07T20:33:00.7228527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7228626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7228736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7228850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7228964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7229033Z     )
2025-05-07T20:33:00.7229277Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7229372Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7229444Z         self,
2025-05-07T20:33:00.7229515Z         T: int,
2025-05-07T20:33:00.7229591Z         D: int,
2025-05-07T20:33:00.7229684Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7229770Z         contiguous: bool,
2025-05-07T20:33:00.7229858Z         compiled: bool,
2025-05-07T20:33:00.7229930Z     ) -> None:
2025-05-07T20:33:00.7230027Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7230096Z     
2025-05-07T20:33:00.7230266Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7232118Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7232127Z 
2025-05-07T20:33:00.7232248Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7232378Z =============================== warnings summary ===============================
2025-05-07T20:33:00.7232696Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:00.7232999Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:00.7233299Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:00.7234228Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:00.7234461Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:00.7234465Z 
2025-05-07T20:33:00.7234678Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:00.7234844Z ================= 1 failed, 1 deselected, 3 warnings in 17.43s =================
2025-05-07T20:33:02.2625196Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:02.3240090Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:33:02.3240806Z 
2025-05-07T20:33:02.3241328Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:33:02.3242967Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:33:02.3244145Z 
2025-05-07T20:33:02.3244157Z 
2025-05-07T20:33:02.3244168Z 
2025-05-07T20:33:02.3261595Z ##[error]Process completed with exit code 1.
2025-05-07T20:33:02.3342463Z Post job cleanup.
2025-05-07T20:33:02.4351341Z [command]/usr/bin/git version
2025-05-07T20:33:02.4391472Z git version 2.47.1
2025-05-07T20:33:02.4430217Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/19d6b077-0aa3-409a-923a-d25e4232a9ba/.gitconfig'
2025-05-07T20:33:02.4440933Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/19d6b077-0aa3-409a-923a-d25e4232a9ba' before making global git config changes
2025-05-07T20:33:02.4442322Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:33:02.4456679Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:33:02.4500041Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:33:02.4535464Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:33:02.4871209Z Entering 'external/asmjit'
2025-05-07T20:33:02.4937313Z Entering 'external/composable_kernel'
2025-05-07T20:33:02.5011871Z Entering 'external/cpuinfo'
2025-05-07T20:33:02.5079113Z Entering 'external/cutlass'
2025-05-07T20:33:02.5152395Z Entering 'external/googletest'
2025-05-07T20:33:02.5219127Z Entering 'external/hipify_torch'
2025-05-07T20:33:02.5285881Z Entering 'external/json'
2025-05-07T20:33:02.5371279Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:33:02.5396959Z http.https://github.com/.extraheader
2025-05-07T20:33:02.5408667Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:33:02.5439913Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:33:02.5768668Z Entering 'external/asmjit'
2025-05-07T20:33:02.5810077Z http.https://github.com/.extraheader
2025-05-07T20:33:02.5852974Z Entering 'external/composable_kernel'
2025-05-07T20:33:02.5896354Z http.https://github.com/.extraheader
2025-05-07T20:33:02.5948393Z Entering 'external/cpuinfo'
2025-05-07T20:33:02.5990602Z http.https://github.com/.extraheader
2025-05-07T20:33:02.6033418Z Entering 'external/cutlass'
2025-05-07T20:33:02.6077460Z http.https://github.com/.extraheader
2025-05-07T20:33:02.6128565Z Entering 'external/googletest'
2025-05-07T20:33:02.6172045Z http.https://github.com/.extraheader
2025-05-07T20:33:02.6215365Z Entering 'external/hipify_torch'
2025-05-07T20:33:02.6259270Z http.https://github.com/.extraheader
2025-05-07T20:33:02.6301682Z Entering 'external/json'
2025-05-07T20:33:02.6344543Z http.https://github.com/.extraheader
2025-05-07T20:33:02.6501325Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:33:02.6532487Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:33:02.6543019Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:33:02.6543380Z ##[endgroup]
2025-05-07T20:33:02.6643709Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:13.4143563Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:33:29.8221116Z Cleaning up orphan processes